Abstract
With the growing integration of distributed energy resources (DERs), flexible loads, and other emerging technologies, there are increasing complexities and uncertainties for modern power and energy systems. This brings great challenges to the operation and control. Besides, with the deployment of advanced sensor and smart meters, a large number of data are generated, which brings opportunities for novel data-driven methods to deal with complicated operation and control issues. Among them, reinforcement learning (RL) is one of the most widely promoted methods for control and optimization problems. This paper provides a comprehensive literature review of RL in terms of basic ideas, various types of algorithms, and their applications in power and energy systems. The challenges and further works are also discussed.
WITH the gradual depletion of fossil energy and increasing environmental pressure, a revolution in energy sector is going on globally [
Various approaches have been proposed for the optimization and control of modern power and energy system. In general, the optimization method can be broadly classified into classical algorithms [
Different from the methods mentioned above, RL is a class of method that is inspired from behavioral psychology. RL can extract optimal operational knowledge from historical data through continuous interactions with the environment while the global optimum is unknown. They can get rid of the dependency on the accurate physical model by learning a surrogate model [
1) Typical RL, deep RL (DRL), and multi-agent DRL (MADRL) for optimization and control of modern power and energy systems are summarized thoroughly with the detailed analysis of advantages and disadvantages.
2) State-of-the-art applications of RL algorithms in power and energy systems are organized with several categories.
3) A comprehensive analysis of the limitations of current RL algorithms is presented.
The structure of this paper is as follows. Section II introduces the RL algorithms. A comprehensive review of the RL for power systems applications is shown in Section III. Section IV discusses the challenges and prospects of RL in power systems and conclude this paper.
In this section, the Markov decision process (MDP) is first illustrated, followed by the classical RL, advanced DRL, and MADRL algorithms.
ML algorithms can be classified into three categories: unsupervised learning, supervised learning, and RL. Unsupervised learning typically includes clustering, dimensionality reduction, and association rule learning methods, etc. Supervised learning, which typically acts as the function approximator, aims to build an affine rule mapping from the training input to the labeled output utilizing a predefined evaluation index [
Compared with supervised learning and unsupervised learning, RL is regarded as active learning. The basic structure of RL is shown in

Fig. 1 Framework of RL algorithm.
There are mainly two components: the agent and the environment [

Fig. 2 Main RL algorithms and their relationships.
In the framework of RL, the interaction between agent and environment is formalized by MDP [
1) Action : is the action set, and is a specific action.
2) State : is a finite state set, and is a given state.
3) Transition model : the transition model determines the prediction probability of the next step state given the current state and action .
4) Reward function : is the immediate reward by the agent when taking action under state .
5) : the discount factor is used to balance the importance of immediate rewards relative to future rewards.
6) Policy : a policy mapping from states to actions is yielded when solving an MDP. An optimal policy means that the maximum expected discount cumulative reward can be obtained.
The illustration of MDP is shown in

Fig. 3 Illustration of MDP.
At each epoch, the environment takes the current state and action as the input, and the output is the current reward and the state of the next step . The quality of action under state is measured by the cumulative discounted reward, which is obtained by the agent from current time-step onward:
(1) |
where is the expectation of the cumulative discounted reward; is the current cumulative reward by the agent from time step t onward; and is the so-called action-value function. RL algorithm aims to look for an optimal policy so as to maximize the action-value function.
Considering that the future system information is unknown, it is intractable for agent to determine the optimal policy . Thus, iterative update of action-value function based on Bellman equation is adopted by the Q-learning algorithm [
(2) |
With iteration , the Q-value will converge to the optimal value . Then, the optimal control schedules can be obtained based on a greedy strategy:
(3) |
Original Q-learning algorithm stores the action values in a discretized lookup-table, the size of which is determined by the dimensions of states and actions. However, multivariate continuous state and action variables are typically needed in practical applications of power and energy system. The discretization of the state and action variables not only leads to the sharp increase of the computational complexity, but also wastes valuable information about the structure of state and action domain that are essential for solving problems.
Traditional RL algorithms have several limitations. Firstly, they suffer from “curse of dimensionality” when coping with scenarios with high-dimension and continuous state and action space. Secondly, hand-specified state representations are typically required. As a function approximator, DNN can be applied to address the above limitations by approximating the state-action function with the parameters of neural network (NN). Combining the DNN and the RL algorithm has two advantages: ① the strong feature extraction ability of DNN helps avoid the manually feature design process, and the control decisions can be directly derived from the raw inputs through end-to-end learning procedure; ② DNN helps RL generalize problems with a large state space [
One of the breakthroughs for DRL is the value-based DQN algorithm, which uses DNN as the function approximator to fit the action-value function. The structure of the DQN algorithm is shown in

Fig. 4 Structure of DQN algorithm.
DQN algorithm adopts a replay buffer to store a large number of transitions . The experience replay mechanism helps break the correlation among training data by randomly sampling a mini-batch data from the memory when updating the NN. DQN also introduces a target Q network to alleviate the non-stationary distribution of training data, significantly improving the stability of training process. At each time step, the parameters of the action-value function are optimized by minimizing the following loss function [
(4) |
where and are the parameters of the action-value function. DQN has several improved versions to reduce overestimation, such as double DQN [
The DNN utilized in DQN avoids the discretization of state space. However, since DQN relies on finding an action which maximizes the action-value function, it still needs to discretize the action domain for the applications with continuous action variables. The discretization of action domain may lead to the curse of the dimensionality issue since the number of total actions increases exponentially with the number of action types. Moreover, the discretization of action space may cause information loss and lead to sub-optimal solutions. This makes it intractable to apply the DQN-based method to applications with high-dimension and continuous action space.
Policy gradient algorithm is a kind of algorithm suitable for the tasks with continuous and high-dimension action space. Instead of learning the action-value function, policy gradient algorithm directly learns an affine rule mapping from the observed state to control decision. Policy gradient algorithm maintains a policy function parameterized by the weights of . It aims to maximize the expected cumulative reward by optimizing . Specifically, the parameters are optimized via the gradient:
(5) |
where is a trajectory, T is the episode length; and is the cumulative reward of the trajectory. The parameters of the NN are optimized towards the direction that increases the probability of the trajectory with a larger reward. The variance of the gradient is high in policy gradient algorithm. To this end, a baseline term is typically subtracted from .
The baseline term in the policy gradient algorithms is typically replaced by the value function learned by the critic function. This fits to the actor-critic algorithm, which is a subset of the policy gradient algorithms. The basic structure of actor-critic-based algorithms is shown in

Fig. 5 Structure of actor-critic-based algorithms.
DDPG is an actor-critic-based algorithm. It employs two functions for different purposes: the actor function learns the control policy and the critic function provides the judgement of the actor. The actor and critic are trained against each other so that the actor can learn a better control strategy and the critic can provide a more accurate judgement. The DDPG also introduces the experience replay mechanism and the target networks to stabilize the training. The parameters of the critic network are optimized by minimizing the following loss function [
(6) |
where N is the number of samples in one batch; and are the target critic and actor networks, respectively; and and are the parameters of target critic and actor networks, respectively. The parameters of the actor network are optimized according to the following deterministic policy gradient [
(7) |
The parameters of target networks are optimized by the soft update mechanism to alleviate the non-stationary distribution of training data.
Different from the experience replay mechanism used in DQN and DDPG, A3C algorithm employs multiple parallelized workers to break the correlations among the training data and stabilize the training. The gradients are first calculated by multiple local actors, and then passed to the global NN to perform the optimization. An entropy term is also added to the loss function to improve exploration and help convergence to a better policy. The parameters of the policy function are optimized by [
(8) |
where is the advantage function; is the entropy function; is the weight of the entropy term; and T is the horizon of time step. SAC is also an entropy-regularized-based DRL algorithm. It adopts a value function and two action-value functions, and alternates between updating using the sampled batches from the memory and collecting experiences following the current policy [
Classical policy gradient algorithms also include trust region policy optimization (TRPO) [
The algorithms mentioned above only use single agent. However, a lot of applications involve the interactions among multiple agents, such as multiplayer games and multi-robot control problem. The application of single-agent DRL algorithm to multi-agent environment yields a poor performance as the environment can become non-stationary from the point of view of each individual agent. This prevents the use of memory replay mechanism and brings stability challenges during training. The policy gradient algorithms suffer from high variance when the coordination among agents is required. The details of MADRL are elaborated as follows.
Markov game is a multi-agent extension of MDP. It consists of four components: a state set S, action sets for all the agents , a transition function , and reward functions for all agents . Each agent i chooses actions according to its local observations, and then obtains a reward that is a function of the state and action of all agents. Next, the environment reacts to all agents’ action and transfers to next state. The aim of agent i is to learn a policy to maximize the discounted cumulative reward .
The existing MADRL algorithms can be classified into the following groups.
1) Improved experience replay mechanism. Experience replay mechanism is a major breakthrough that enables the combination of deep learning and RL. It helps break the correlation between training data, which is a pre-condition of the convergence of NN. However, the experience replay mechanism fails in MADRL setting since it assumes the environment to be stationary while the environment is non-stationary from any individual agent point of view. Therefore, the data sampled from the replay buffer cannot represent the current dynamics of the environment. To this end, several works try to add information to the experience tuple to help the algorithm adapt to MADRL settings [
2) Centralized training and decentralized execution. A basic idea to guarantee a stationary environment in MADRL setting is to allow each agent know the policy of other agents. Inspired by this, [

Fig. 6 Structure of MADDPG algorithms.
Each agent employs a centralized critic, which takes the global observation and actions of other agents as inputs to guarantee the Markov property. Since the global information is used by the critic during training, the actor can inform decisions based on local information when implemented in practice. Centralized training and decentralized execution framework are effective approaches to overcome the nonstationary issue in MADRL setting when off-line training can be implemented in a simulator. Attention mechanism can be further integrated with this framework to enhance the performance of the MADRL algorithm [
3) Recurrent network-based approaches. Recurrent NNs (RNNs) enhance the memory capability of NNs. RNNs are used in single-agent DRL to address partially observable problems and long-term credit assignment issues. Recent studies also extend RNNs to the MADRL setting to solve the challenges of partially observable Markov games [
4) Parameter sharing. Parameter sharing is a frequent component in MADRL, which employs training a network whose parameters are shared among agents. Since the agents take different information as inputs, they can inform different decisions. This approach is proposed in [
Applications of RL algorithms for power and energy systems have been growing in recent years, including the optimization of smart power and energy distribution grid, flexible load demand, electricity market, and operational control and so on.
The voltage fluctuation and power quality issues caused by the increasing penetration of DERs and electric vehicles (EVs) in distribution networks bring great challenges to the operation of the distribution network. Traditional methods such as stochastic programming and robust optimization could not effectively address highly uncertain environment. In addition, they rely heavily on the accurate parameters of the distribution system, which is difficult to obtain in practice. As a data-driven approach, DRL can provide more flexible control decisions in real time according to the latest information.
Reference [
The RL algorithms have also been applied to the optimization of microgrids in [
IES refers to the integrated system of energy production, supply, and marketing in the process of planning, construction, and operation. It is mainly composed of energy supply network, such as power supply, gas supply, cooling/heating network, energy exchange unit (combined cooling and heating power plant, generator set, boiler, air conditioner, heating pump, etc.), energy storage link (battery, gas storage, heat and cold storage, etc.), terminal integrated energy supply unit (microgrid) and a large number of customers, as shown in

Fig. 7 Framework of IES.
Reference [
Employing RL-based approaches for the optimization of smart power and energy distribution grids can provide the following advantages. Firstly, they can develop near-optimal control behaviors by the continuous interaction with the environment. The learned strategy is scalable to new situations and can provide decisions in milliseconds, without resolving the problem. Therefore, they can provide more flexible control performances than pre-determined decisions when facing highly uncertain environment. Secondly, they are data-driven and reduce the dependence on accurate system model.
The integration of renewable energy to power system must be carefully done to guarantee system security. At the same time, the users’ adjustable flexible load significantly increases with the rapid development of residential smart power consumption. Demand-side management can improve the stability of power grid by changing load consumption behavior via economic incentives and increasing the flexibility of demand. As a model-free algorithm, RL can deal with the uncertainty of the environment and extract human preferences by integrating the feedback reward signal into control logic [
The first category to apply RL methods for demand-side management is the control of domestic hot water and heating ventilation air-conditioning devices. The objectives can be the energy cost reduction of building energy systems [
EV charging is a challenging problem owing to the randomness in the commuting behaviors of EV owner and traffic conditions, and the fluctuations of electricity price. Traditional methods rely on the forecasting information and it is difficult to obtain distribution of random variables in practice. As a model-free/data-driven approach, RL can learn the transfer probability and develop an optimal control strategy without the requirement of mathematical models.
In [
With the increasing penetration of DERs and flexible demands, the electricity market is facing more uncertainties and complexities from both the generation and demand sides. This motivates the generation companies to design more sophisticated bidding strategy to reduce the revenue loss when participating in the liberalized electricity market. Reference [
In order to ensure the safe and stable operation of the system, different stability controllers have been developed. Typically, the parameters of controllers are tuned based on the linearized model of the system under a certain operation condition. However, the integration of more power electronics-interfaced DERs and loads makes it even more challenging.
To solve this problem, adaptive control is used for the self-tuning of the controller parameter settings. In [
As mentioned above, the Q-learning algorithm is only suitable for the discretized action domain. To address that, in [
The application summary of single-agent RL in power and energy systems is shown in
Single-agent RL algorithms rely heavily on the centralized framework, which requires complete communication links and costly communication devices. With the increasing penetration of DERs and flexible loads, modern power and energy systems are becoming more complex and larger with more operation conditions and control options, which make it difficult for these methods to scale up. These issues can be effectively solved by the MADRL framework, as shown in
Two main categories are identified and reviewed here according to the implementation types.
The first category is the independent learner-based approach, which directly applies the single-agent algorithm into the multi-agent setting. Reference [
Reference [
Learning in multi-agent setting is much more complex than in single-agent cases as each agent needs to learn the dynamic of the environment as well as the policies from other agents. For each agent, the environment is nonstationary since the policies of other agents change continuously during training, leading to the violation of Markov property. Although this category of methods violates the basic assumption of RL and lacks convergence guarantees, they have actually been used in some scenarios in practice, and simulation results demonstrate that good results and better scalability can be achieved in certain circumstances.
Centralized training and decentralized execution have the general MADRL framework which employs centralized critics to guarantee the Markov property utilizing the global information during training. Reference [
The increasing complexity and uncertainty in modern power and energy systems, as well as the wide-area deployment of advanced sensors make the ML-based approach a promising alternative for power system operation and control. This paper conducts a comprehensive review of RL algorithms and their applications in power and energy systems. A review of widely accepted algorithms in RL, DRL, and MADRL is first provided. Then, the applications of RL algorithms in power and energy systems are investigated in detail, including the optimization of distribution networks and microgrid, energy management, electricity market, demand response, and operation control. Several applications of MADRL are presented as well.
Although numerous applications of RL in modern power and energy systems have been studied, there are still many interesting problems worth further studies, including but are not limited to the follows.
1) Since the power and energy systems have a high requirement for safety, the physical constraints should be better handled when building the RL model instead of directly adding soft constraints to the reward function. Safe RL is a suitable way to deal with the optimization and control problems by solving a constrained MDP. Other ways that can embed the physical knowledge in the RL model may also improve the reliability and motivate the real-world implementation.
2) The offline training relies on the accurate physical model while the online training may affect the operation of power and energy systems. Batch RL and surrogate model are two ways to reduce the dependence on physical model without impacting the operation of system. However, they require a certain amount of historical data. Transfer learning may be another promising alternative by training the RL model offline and transferring the learned control strategy to real-world environments with few-shot recorded samples and iterations.
3) Modern power and energy systems are becoming more complex and larger with more operation conditions and control options. Single-agent RL algorithm adopts centralized framework that relies heavily on the complete communication links, and thus is incapable of dealing with communication delay and scaling up to large systems. MADRL can partially mitigate this issue by adopting a centralized training, decentralized execution framework. However, existing MADRL algorithms face great challenges when dealing with very large systems that require large populations. Further research may apply advanced MADRL algorithms with novel population scaling mechanisms to enable the RL method to scale up to very large systems.
4) A lot of control and optimization problems in power and energy systems have typical hierarchical structures, as well as the decision-making process of human being. Hierarchical framework can reduce the deployment cost of complete communication devices of centralized control and avoid the isolation issue of local control, and thus it is another promising way for the control of large systems. Owing to the complexity of hierarch structure and the lack of a general hierarchical framework, applications of RL for hierarchical control are rare in power and energy systems. Future research may apply RL-based hierarchical control framework for large systems.
5) With the increased integration power electronic devices, DERs and flexible loads, the complexities and uncertainties are growing in modern power and energy systems. Classical offline training and online execution manner of RL is incapable of dealing with the continuously generated unmolded dynamics. Meta-learning and continuous learning can be integrated with the RL algorithm to achieve the life-long learning ability. This helps the continuous transformation of on-line data into powerful knowledge, which can successively enhance the control behavior of the RL agent. Therefore, the robustness and adaptability to unmolded system dynamics can be enhanced, and the trainning time can be shorten in complex scenarios.
References
R. Detchon and R. Van Leeuwen, “Policy: bring sustainable energy to the developing world,” Nature, vol. 508, no. 7496, pp. 309-311, Apr. 2014. [百度学术]
B. Kroposki, “Integrating high levels of variable renewable energy into electric power systems,” Journal of Modern Power Systems and Clean Energy, vol. 5, no. 6, pp. 831-837, Nov. 2017. [百度学术]
J. Zhu, E. Zhuang, J. Fu et al., “A framework-based approach to utility big data analytics,” IEEE Transactions on Power Systems, vol. 31, no. 3, pp. 1-8, Aug. 2015. [百度学术]
D. Cao, J. Li, D. Cai et al., “Design and application of big data platform architecture for typical scenarios of power system,” in Proceedings of 2018 IEEE PES General Meeting (PESGM), Portland, USA, Aug. 2018, pp. 1-5. [百度学术]
L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” Journal of Artificial Intelligence Research, vol. 4, no. 1, pp. 237-285, Apr. 1996. [百度学术]
Y. Xu, Z. Dong, R. Zhang et al., “Multi-timescale coordinated voltage/var control of high renewable-penetrated distribution systems,” IEEE Transactions on Power Systems, vol. 32, no. 6, pp. 4398-4408, Feb. 2017. [百度学术]
P. Li, C. Zhang, Z. Wu et al., “Distributed adaptive robust voltage/var control with network partition in active distribution networks,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2245-2256, Oct. 2019. [百度学术]
B. Zhao, Z. Xu, C. Xu et al., “Network partition-based zonal voltage control for distribution networks with distributed PV systems,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4087-4098, Jan. 2017. [百度学术]
F. L. Pagola, I. J. Perez-Arriaga, and G. C. Verghese, “On sensitivities, residues and participations: applications to oscillatory stability analysis and control,” IEEE Transactions on Power Systems, vol. 4, no. 1, pp. 278-285, Mar. 1989. [百度学术]
C. Chung, L. Wang, F. Howell et al., “Generation rescheduling methods to improve power transfer capability constrained by small-signal stability,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 524-530, Mar. 2004. [百度学术]
Z. Bouchama and M. Harmas, “Optimal robust adaptive fuzzy synergetic power system stabilizer design,” Electric Power Systems Research, vol. 83, no. 1, pp. 170-175, Feb. 2012. [百度学术]
S. Das and I. Pan, “On the mixed H∞/H2 loop-shaping tradeoffs in fractional-order control of the AVR system,” IEEE Transactions on Industrial Informatics, vol. 10, no. 4, pp. 1982-1991, Nov. 2013. [百度学术]
D. Ke, F. Shen, C. Chung et al., “Application of information gap decision theory to the design of robust wide-area power system stabilizers considering uncertainties of wind power,” IEEE Transactions on Sustainable Energy, vol. 9, no. 2, pp. 805-817, Apr. 2018. [百度学术]
P. Zhao, W. Yao, S. Wang et al., “Decentralized nonlinear synergetic power system stabilizers design for power system stability enhancement,” International Transaction on Electrical Energy Systems, vol. 24, no. 9, pp. 1356-1368, Sept. 2014. [百度学术]
M. J. Morshed and A. Fekih, “A probabilistic robust coordinated approach to stabilize power oscillations in DFIG-based power systems,” IEEE Transactions on Industrial Informatics, vol. 15, no. 10, pp. 5599-5612, Oct. 2019. [百度学术]
Z. Ni, Y. Tang, X. Sui et al., “An adaptive neuro-control approach for multi-machine power systems,” International Journal of Electrical Power & Energy Systems, vol. 75, pp. 108-116, Feb. 2016. [百度学术]
D. Cao, J. Zhao, W. Hu et al. (2020, Jun.). Model-free voltage regulation of unbalanced distribution network based on surrogate model and deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/2006.13992. [百度学术]
Y. Gao, W. Wang, J. Shi et al., “Batch-constrained reinforcement learning for dynamic distribution network reconfiguration,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5357-5369, Nov. 2020. [百度学术]
Y. Gao, R. Zhou, H. Wang et al., “Study on an average reward reinforcement learning algorithm,” Chinese Journal of Computers, vol. 30, no. 8, pp. 1372-1378, Aug. 2007. [百度学术]
E. Ipek, O. Mutlu, J. F. Mart et al., “Self-optimizing memory controllers: A reinforcement learning approach,” ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 39-50, Jul. 2008. [百度学术]
J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: a survey,” International Journal of Robotics Research, vol. 32, no. 11, pp. 1238-1274, Jan. 2013. [百度学术]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge: MIT Press, 2016. [百度学术]
R. S. Sutton and A. G. Barto, Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998. [百度学术]
P. Hernandezleal, B. Kartal, and M. E. Taylor, “A survey and critique of multiagent deep reinforcement learning,” Autonomous Agents and Multi-agent Systems, vol. 33, no. 6, pp. 750-797, Oct. 2019. [百度学术]
C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006. [百度学术]
V. Mnih, K. Kavukcuoglu, D. Silver et al. (2013, Dec.). Playing Atari with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1312.5602 [百度学术]
H. Van Hasselt, A. Guez, and D. Silver. (2015, Sept.). Deep reinforcement learning with double Q-learning. [Online]. Available: https://arxiv.org/abs/1509.06461v1 [百度学术]
Z. Wang, T. Schaul, M. Hessel et al., “Dueling network architectures for deep reinforcement learning,” in Proceedings of International Conference on Machine Learning, Lille, France, Jul. 2015, pp. 1995-2003. [百度学术]
T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” in Proceedings of International Conference on Learning Representation (ICLR), San Diego, USA, May 2015, pp. 1-14. [百度学术]
D. Silver, G. Lever, N. Heess et al., “Deterministic policy gradient algorithms,” in Proceedings of International Conference on Machine Learning, Beijing, China, Jun. 2014, pp. 387-395. [百度学术]
V. Mnih, A. P. Badia, M. Mirza et al. (2016, Jun.). Asynchronous methods for deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1602.01783 [百度学术]
T. Haarnoja, A. Zhou, P. Abbeel et al. (2018, Jan.). Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. [Online]. Available: https://arxiv.org/abs/1801.01290 [百度学术]
J. Schulman, S. Levine, P. Abbeel et al., “Trust region policy optimization,” in Proceedings of International Conference on Machine Learning, Lille, France, Jul. 2015, pp. 1889-1897. [百度学术]
J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [百度学术]
S. Omidsha_Ei, J. Pazis, C. Amato et al., “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 2681-2690. [百度学术]
J. N. Foerster, N. Nardelli, G. Farquhar et al., “Stabilizing experience replay for deep multi-agent reinforcement learning,” in Proceedings of International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 1-10. [百度学术]
G. Palmer, K. Tuyls, D. Bloembergen et al. (2017, Jul.). Lenient multi-agent deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1707.04402 [百度学术]
R. Lowe, Y. Wu, A. Tamar et al., “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of Advances in Neural Information Processing Systems, Long Beach, USA, Jun. 2017, pp. 6379-6390. [百度学术]
S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” in Proceedings of International Conference on Machine Learning, Stockholm, Sweden, Oct. 2018, pp. 2961-2970. [百度学术]
Z. Hong, S. Su, T. Shan et al., “A deep policy inference Q-network for multi-agent systems,” in Proceedings of International Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil, Nov. 2017, pp. 1388-1396. [百度学术]
M. Jaderberg, W. M. Czarnecki, D. Iain et al., “Human-level performance in 3D multiplayer games with population-based reinforcement learning,” Science, vol. 364, no. 6443, pp. 859-865, May 2019. [百度学术]
J. N. Foerster, Y. M. Assael, N. De Freitas et al., “Learning to communicate with deep multi-agent reinforcement learning,” in Proceedings of Advances in Neural Information Processing Systems, Barcelona, Spain, May 2016, pp. 2145-2153. [百度学术]
J. K. Gupta, M. Egorov, and M. J. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Proceedings of International Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil, Nov. 2017, pp. 66-83. [百度学术]
M. Al-Saffar and P. Musilek, “Reinforcement learning based distributed BESS management for mitigating overvoltage issues in systems with high PV penetration,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 2980-2994, Feb. 2020. [百度学术]
Q. Yang, G. Wang, A. Sadeghi et al., “Two-timescale voltage control in distribution grids using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2313-2323, Nov. 2019. [百度学术]
W. Wang, N. Yu, Y. Gao et al., “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008-3018, Dec. 2019. [百度学术]
Z. Yan and Y. Xu, “Real-time optimal power flow: a Lagrangian based deep reinforcement learning approach,” IEEE Transactions on Power Systems, vol. 35, no. 4, pp. 3270-3273, Apr. 2020. [百度学术]
Y. Gao, W. Wang, J. Shi et al., “Batch-constrained reinforcement learning for dynamic distribution network reconfiguration,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5357-5369, Nov. 2020. [百度学术]
H. Xu, A. D. Domínguez-García, and P. W. Sauer, “Optimal tap setting of voltage regulation transformers using batch reinforcement learning,” IEEE Transactions on Power Systems, vol. 35, no. 3, pp. 1990-2001, Oct. 2019. [百度学术]
V. François-Lavet, D. Taralla, and D. Ernst, “Deep reinforcement learning solutions for energy microgrids management,” in Proceedings of European Workshop on Reinforcement Learning (EWRL), Barcelona, Spain, Nov. 2016, pp. 1-7. [百度学术]
E. Kuznetsova, Y. Li, C. Ruiz, et al., “Reinforcement learning for microgrid energy management,” Energy, vol. 59, pp. 133-146, May 2013. [百度学术]
X. Yang, Y. Wang, H. He et al., “Deep reinforcement learning for economic energy scheduling in data center microgrids,” in Proceedings of 2019 IEEE PES General Meeting (PESGM), Atlanta, USA, Aug. 2019, pp. 1-5. [百度学术]
V. H. Bui, A. Hussain, and H. M. Kim, “Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties,” IEEE Transactions on Smart Grid, vol. 11, no. 1, pp. 457-469, Jun. 2019. [百度学术]
Q. Zhang, K. Dehghanpour, Z. Wang et al., “A learning-based power management method for networked microgrids under incomplete information,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1193-1204, Aug. 2020. [百度学术]
Y. Du and F. Li, “Intelligent multi-microgrid energy management based on deep neural network and model-free reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1066-1076, Jul. 2019. [百度学术]
Q. Sun, D. Wang, D. Ma et al., “Multi-objective energy management for we-energy in Energy Internet using reinforcement learning,” in Proceedings of 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, USA, Nov. 2017, pp. 1-6. [百度学术]
B. Zhang, W. Hu, D. Cao et al., “Deep reinforcement learning-based approach for optimizing energy conversion in integrated electrical and heating system with renewable energy,” Energy Conversion and Management, vol. 202, p. 112199, Dec. 2019. [百度学术]
Y. Ye, D. Qiu, X. Wu et al., “Model-free real-time autonomous control for a residential multi-energy system using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3068-3082, Feb. 2020. [百度学术]
B. Zhang, W. Hu, J. Li et al., “Dynamic energy conversion and management strategy for an integrated electricity and natural gas system with renewable energy: deep reinforcement learning approach,” Energy Conversion and Management, vol. 220, p. 113063, Sept. 2020. [百度学术]
A. Sheikhi, M. Rayati, and A. M. Ranjbar, “Demand side management for a residential customer in multi-energy systems,” Sustainable Cities and Society, vol. 22, pp. 63-77, Jan. 2016. [百度学术]
J. R. Vázquez-Canteli and Z. Nagy, “Reinforcement learning for demand response: a review of algorithms and modeling techniques,” Applied Energy, vol. 235, pp. 1072-1089, Feb. 2019. [百度学术]
G. P. Henze and S. Liu, “Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory–part 1: theoretical foundation.” Energy and Buildings, vol. 38, no. 2, pp. 142-147, Feb. 2006. [百度学术]
G. P. Henze and S. Liu, “Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory–part 2: results and analysis,” Energy and Buildings, vol. 38, no. 2, pp. 148-161, Feb. 2006. [百度学术]
G. P. Henze and S. Liu, “Evaluation of reinforcement learning for optimal control of building active and passive thermal storage inventory,” Journal of Solar Energy Engineering, vol. 129, no. 2, pp. 215-225, May 2007. [百度学术]
J. Vázquez-Canteli, J. Kampf, and Z. Nagy, “Balancing comfort and energy consumption of a heat pump using batch reinforcement learning with fitted Q-iteration,” Energy Procedia, vol. 122, pp. 415-420, Sept. 2017. [百度学术]
F. Ruelens, B. J. Claessens, S. Vandael et al., “Residential demand response of thermostatically controlled loads using batch reinforcement learning,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2149-2159, Feb. 2016. [百度学术]
H. Kazmi, F. Mehmood, S. Lodeweyckx et al., “Gigawatt-hour scale savings on a budget of zero: deep reinforcement learning based optimal control of hot water systems,” Energy, vol. 144, pp. 159-168, Dec. 2017. [百度学术]
Y. Liang, L. He, X. Cao et al., “Stochastic control for smart grid users with flexible demand,” IEEE Transaction on Smart Grid, vol. 4, no. 4, pp. 2296-2308, Dec. 2013. [百度学术]
Y. Liu, C. Yuen, N. U. Hassan et al., “Electricity cost minimization for a microgrid with distributed energy resource under different information availability,” IEEE Transaction on Industrial Electronics, vol. 62, no. 4, pp. 2571-2583, Jan. 2014. [百度学术]
H. Li, Z. Wan, and H. He, “Real-time residential demand response,” IEEE Transaction on Smart Grid, vol. 11, no. 5, pp. 4144-4154, Mar. 2020. [百度学术]
E. Mocanu, D. C. Mocanu, P. H. Nguyen et al., “On-line building energy optimization using deep reinforcement learning,” IEEE Transaction on Smart Grid, vol. 10, no. 4, pp. 3698-3708, Jul. 2017. [百度学术]
B. V. Mbuwir, F. Spiessens, and G. Deconinck, “Self-learning agent for battery energy management in a residential microgrid,” in Proceedings of 2018 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Sarajevo, Bosnia, Oct. 2018, pp. 1-6. [百度学术]
A. Chis, J. Lunden, and V. Koivunen, “Reinforcement learning-based plug-in electric vehicle charging with forecasted price,” IEEE Transaction on Vehicular Technology, vol. 66, no. 5, pp. 3674-3684, Jan. 2016. [百度学术]
J. Wu, H. He, J. Peng et al., “Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus,” Applied Energy, vol. 222, pp. 799-811, Jul. 2018. [百度学术]
T. Liu, Y. Zou, D. Liu et al., “Reinforcement learning of adaptive energy management with transition probability for a hybrid electric tracked vehicle,” IEEE Transaction on Industrial Electronics, vol. 62, no. 12, pp. 7837-7846, Dec. 2015. [百度学术]
Y. Hu, W. Li, K. Xu et al., “Energy management strategy for a hybrid electric vehicle based on deep reinforcement learning,” Applied Sciences, vol. 8, no. 2, pp. 187-202, Jan. 2018. [百度学术]
X. Qi, Y. Luo, G. Wu et al., “Deep reinforcement learning-based vehicle energy efficiency autonomous learning system,” in Proceedings of 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, USA, Jun. 2017, pp. 1228-1233. [百度学术]
X. Qi, Y. Luo, G. Wu et al., “Deep reinforcement learning enabled self-learning control for energy efficient driving,” Transportation Research Part C: Emerging Technologies, vol. 99, pp. 67-81, Feb. 2019. [百度学术]
Y. Wu, H. Tan, J. Peng et al., “Deep reinforcement learning of energy management with continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus,” Applied Energy, vol. 247, pp. 454-466, Aug. 2019. [百度学术]
Z. Wan, H. Li, H. He et al., “A data-driven approach for real-time residential EV charging management,” in Proceedings of 2018 IEEE Power & Energy Society General Meeting (PESGM), Portland, USA, Aug. 2018, pp. 1-5. [百度学术]
Z. Wan, H. Li, H. He et al., “Model-free real-time EV charging scheduling based on deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 10, no. 5, pp. 5246-5257, Nov. 2018. [百度学术]
H. Li, Z. Wan, and H. He, “Constrained EV charging scheduling based on safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2427-2439, Nov. 2019. [百度学术]
M. Rahimiyan and H. R. Mashhadi, “Supplier’s optimal bidding strategy in electricity pay-as-bid auction: comparison of the Q-learning and a model-based approach,” Electric Power Systems Research, vol. 78, no. 1, pp. 165-175, Jan. 2008. [百度学术]
M. B. Naghibi-Sistani, M. Akbarzadeh-Tootoonchi, M. J. D. Bayaz et al., “Application of Q-learning with temperature variation for bidding strategies in market based power systems,” Energy Conversion and Management, vol. 47, no. 11, pp. 1529-1538, Jan. 2006. [百度学术]
G. Xiong, T. Hashiyama, and S. Okuma, “An electricity supplier bidding strategy through Q-learning,” in Proceedings of IEEE Power Engineering Society Summer Meeting, Chicago, USA, vol. 3, Aug. 2002, pp. 1516-1521. [百度学术]
H. Song, C. Liu, J. Lawarree et al., “Optimal electricity supply bidding by Markov decision process,” IEEE Transactions on Power Systems, vol. 15, no. 2, pp. 618-624, Jun. 2000. [百度学术]
V. Nanduri and T. K. Das, “A reinforcement learning model to assess market power under auction-based energy pricing,” IEEE Transactions on Power Systems, vol. 22, no. 1, pp. 85-95, Mar. 2007. [百度学术]
A. C. Tellidou and A. G. Bakirtzis, “Agent-based analysis of capacity with holding and tacit collusion in electricity markets,” IEEE Transactions on Power Systems, vol. 22, no. 4, pp. 1735-1742, Dec. 2007. [百度学术]
X. Xu, Y. Xu, J. Li et al., “Data-driven game-based pricing for sharing rooftop photovoltaic generation and energy storage in the residential building cluster under uncertainties,” IEEE Transactions on Industrial Informatics. doi: 10.1109/TII.2020.3016336 [百度学术]
G. Li and J. Shi, “Agent-based modeling for trading wind power with uncertainty in the day-ahead wholesale electricity markets of single-sided auctions,” Applied Energy, vol. 99, pp. 13-22, Nov. 2012. [百度学术]
Y. Ye, D. Qiu, M. Sun et al., “Deep reinforcement learning for strategic bidding in electricity markets,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1343-1355, Aug. 2019. [百度学术]
D. Cao, W. Hu, X. Xu et al., “Bidding strategy for trading wind energy and purchasing reserve of wind power producer–a DRL based approach,” International Journal of Electrical Power & Energy Systems, vol. 117, p. 105648, May 2020. [百度学术]
H. Xu, H. Sun, D. Nikovski et al., “Deep reinforcement learning for joint bidding and pricing of load serving entity,” IEEE Transactions on Smart Grid, vol. 10, no. 6, pp. 6366-6375, Mar. 2019. [百度学术]
T. Yu, B. Zhou, K. W. Chan et al., “Stochastic optimal relaxed automatic generation control in non-Markov environment based on multi-step Q(λ) learning,” IEEE Transactions on Power Systems, vol. 26, no. 3, pp. 1272 -1282, Aug. 2011. [百度学术]
R. Hadidi and B. Jeyasurya, “Reinforcement learning based real-time wide-area stabilizing control agents to enhance power system stability,” IEEE Transactions on Smart Grid, vol. 4, no. 1, pp. 489-497, Mar. 2013. [百度学术]
R. Diao, Z. Wang, D. Shi et al., “Autonomous voltage control for grid operation using deep reinforcement learning,” in Proceedings of IEEE PES General Meeting, Atlanta, USA, Aug. 2019, pp. 1-5. [百度学术]
Q. Huang, R. Huang, W. Hao et al., “Adaptive power system emergency control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1171-1182, Aug. 2019. [百度学术]
T. Lan, J. Duan, B. Zhang et al. (2019, Nov.). AI-based autonomous line flow control via topology adjustment for maximizing time-series ATCs. [Online]. Available: https://arxiv.org/abs/1911.04263 [百度学术]
Z. Yan and Y. Xu, “Data-driven load frequency control for stochastic power systems: a deep reinforcement learning method with continuous action search,” IEEE Transactions on Power Systems, vol. 34, no. 2, pp. 1653-1656, Nov. 2018. [百度学术]
J. Duan, D. Shi, R. Diao et al., “Deep-reinforcement-learning-based autonomous voltage control for power grid operations,” IEEE Transactions on Power Systems, vol. 35, no. 1, pp. 814-817, Sept. 2019. [百度学术]
Y. Hashmy, Z. Yu, D. Shi et al., “Wide-area measurement system-based low frequency oscillation damping control through reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5072-5083, Nov. 2020. [百度学术]
G. Zhang, W. Hu, D. Cao et al., “A data-driven approach for designing statcom additional damping controller for wind farms,” International Journal of Electrical Power & Energy Systems, vol. 117, p. 105620, May 2020. [百度学术]
G. Zhang, W. Hu, D. Cao et al., “Deep reinforcement learning-based approach for proportional resonance power system stabilizer to prevent ultra-low-frequency oscillations,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5260-5272, Nov. 2020. [百度学术]
J. Duan, Z. Yi, D. Shi et al., “Reinforcement-learning-based optimal control of hybrid energy storage systems in hybrid AC-DC micro-grids,” IEEE Transactions on Industrial Informatics, vol. 15, no. 9, pp. 5355-5364, Jan. 2019. [百度学术]
R. Lu, S. Hong, and M. Yu, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Transaction on Smart Grid, vol. 10, no. 6, pp. 6629-6639, Apr. 2019. [百度学术]
Y. Zhang, X. Wang, J. Wang et al., “Deep reinforcement learning based volt-var optimization in smart distribution systems,” IEEE Transactions on Smart Grid. doi: 10.1109/TSG.2020.3010130 [百度学术]
X. Xu, Y. Jia, Y. Xu et al., “A multi-agent reinforcement learning-based data-driven method for home energy management,” IEEE Transaction on Smart Grid, vol. 11, no. 4, pp. 3201-3211, Jul. 2020. [百度学术]
C. Chen, M. Cui, F. Li et al., “Model-free emergency frequency control based on reinforcement learning,” IEEE Transactions on Industrial Informatics. doi: 10.1109/TII.2020.3001095 [百度学术]
S. Wang, J. Duan, D. Shi et al., “A data-driven multi-agent autonomous voltage control framework using deep reinforcement learning,” IEEE Transactions on Power Systems, vol. 35, no. 6, pp. 4644-4654, Nov. 2020. [百度学术]
Z. Yan and Y. Xu, “A multi-agent deep reinforcement learning method for cooperative load frequency control of multi-area power systems,” IEEE Transactions on Power Systems, vol. 35, no. 6, pp. 4599-4608, Nov. 2020. [百度学术]
D. Cao, W. Hu, J. Zhao et al., “A multi-agent deep reinforcement learning based voltage regulation using coordinated PV inverters,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 4120-4123, Sept. 2020. [百度学术]
L. Yu, Y. Sun, Z. Xu et al., “Multi-agent deep reinforcement learning for HVAC control in commercial buildings,” IEEE Transactions on Smart Grid. doi: 10.1109/TSG.2020.3011739 [百度学术]
D. Cao, J. Zhao, W. Hu et al. (2020, May). Distributed voltage regulation of active distribution system based on enhanced multi-agent deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/2006.00546 [百度学术]