Abstract
This paper proposes a novel parallel hybrid deep reinforcement learning (DRL) approach to address the real-time energy management problem for microgrid (MG). As the proposed approach can directly approximate a discrete-continuous hybrid policy, it does not require the discretization of continuous actions like regular DRL approaches, which avoids accuracy degradation and the curse of dimensionality. In addition, a novel experience-sharing-based parallel technique is further developed for the proposed approach to accelerate the training speed and enhance the training robustness. Finally, a safety projection technique is introduced and incorporated into the proposed approach to improve the decision feasibility. Comparative numerical simulations with several existing MG real-time energy management approaches (i.e., myopic policy, model predictive control, and regular DRL approaches) demonstrate the effectiveness and superiority of the proposed approach.
DUE to the detrimental effects of fossil fuels on the environment and the decreasing costs of renewable energy sources (RESs), RES deployment has witnessed a significant upswing worldwide [
As a classic approach, the myopic policy [
In recent years, the Markov decision process (MDP)-based approaches have emerged as superior and promising alternative solutions to the REM problem of MG [
To address the inherent limitations of model-based ADP approaches, a growing trend toward the application of model-free DRL approaches in the REM of MG has emerged [
1) Value-based approaches. These approaches learn the state or state-action values and choose the action with the highest value in the state. In [
2) Policy-based approaches. These approaches directly learn the policy function that maps the state and action, allowing them to adapt to the continuous action space problem through either a deterministic or stochastic policy form. As a representative deterministic policy algorithm, a deep deterministic policy gradient (DDPG) was utilized in [
Although existing research has encouraged the application of DRL techniques in the MG REM, these approaches have the following limitations. ① Existing DRL approaches are limited to handling either discrete or continuous actions. This necessitates the discretization of continuous actions when confronted with the problems involving a hybrid action space [
To address these limitations, this paper applies a novel parallel hybrid PPO (PH-PPO) algorithm in the MG REM problem with a hybrid action space. The main contributions of this paper are summarized as follows.
1) A novel hybrid actor-critic (H-AC) architecture is developed using the PH-PPO algorithm. Unlike existing DRL approaches that require the discretization of continuous actions when confronted with a discrete-continuous hybrid action space, the proposed approach adopts the H-AC architecture to deal directly and simultaneously with discrete and continuous actions, leading to faster convergence toward a superior solution.
2) An experience-sharing-based parallel technique is developed for the PH-PPO algorithm, which allows multiple agents to explore different environments simultaneously and share their collected experiences. The experience-sharing-based parallel technique fully utilizes the computational resources of multicore central processing unit (CPU) and graphics processing unit (GPU), resulting in accelerated training speed as well as improved training robustness.
3) A safety projection technique is introduced and incorporated into the PH-PPO algorithm, which utilizes the prior-domain knowledge of the MG REM to restrict the output actions within a feasible range, and greatly enhances the decision feasibility.
The remainder of this paper is organized as follows. Section II introduces the mathematical formulation of the MG REM problem. Section III reformulates the MDP. Section IV presents the PH-PPO algorithm in detail. Section V describes case studies. Finally, Section VI concludes this paper.
We first formulate a mathematical model of MG REM problem as a mixed-integer nonlinear programming (MINLP) problem. A representative MG configuration is considered comprising DGs such as micro-gas turbines (MTs) and diesel generators (DEs), non-dispatchable generators (NGs) such as wind turbines (WTs) and photovoltaic (PV) panels, ESSs, electrical loads, and an energy management system (EMS). The MG is interconnected to the utility grid, thereby engaging in bidirectional power exchange with the utility grid.
The objective of the MG REM problem is to minimize the total operational cost of the MG by efficiently coordinating diverse energy resources and demands within the system while considering the dynamic nature of RESs and load demands. Mathematically, the objective can be expressed as:
(1) |
(2) |
(3) |
(4) |
(5) |
where is the decision variable; is the scheduling period; t is the index of time; is the set of DGs; is the set of ESSs; is the time interval; is the fuel cost of DGs and is formulated as a quadratic function of the active output power of dispatchable units , as shown in (2); , , and are the fuel cost coefficients; is the start-up cost of DGs and can be calculated by (3); is the on/off status of DGs (1 for operation and 0 for shutdown); is the start-up cost of generator ; is the power exchange cost with the utility grid, which settles the trading power by real-time price , as shown in (4); represents both the electricity purchasing price and feed-in tariff of the MG and is similar to those in [
The MG system is governed by the following constraints.
(6) |
where and are the upper and lower boundaries of the active power generated by the DGs, respectively.
(7) |
where and are the maximum upward and downward ramping rates of the DGs, respectively.
(8) |
where and are the on and off time counters of the unit until time , respectively; and and are the minimum on and off time, respectively.
(9) |
where and are the minimum and maximum power exchanges between the MG and utility grid, respectively.
(10) |
(11) |
where and are the voltage magnitude and phase angle of bus i, respectively; and are the minimum and maximum allowable voltage magnitudes, respectively; and is the set of buses.
(12) |
where is the set of injected elements including DGs, NGs, ESSs, and power exchanges; is the element in the generator-bus incidence matrix (equal to 1 when generator is connected to bus ); and are the active and reactive loads at bus , respectively; and are the active and reactive output power of the injected element , respectively; and are the real and imaginary parts of row and column of the bus admittance matrix, respectively; and is the phase angle difference between buses and .
(13) |
(14) |
where and are the conductance and susceptance of the line between buses and , respectively; and and are the upper and lower limits of the line transmission power between buses and , respectively.
Two binary variables, and , are employed to represent the charging and discharging states of the ESS, respectively. and indicate the charging mode, whereas and indicate the discharging mode. Let us denote the maximum allowed charging and discharging power as and , respectively. We then have:
(15) |
(16) |
(17) |
where and are the charging and discharging power of ESSs, respectively. Let us denote the energy amount currently stored in ESSs as . The dynamics of are described as:
(18) |
(19) |
where and are the charging and discharging efficiencies, respectively; and and are the minimum and maximum energy limits, respectively. Ultimately, the REM problem of MG is mathematically formulated as an MINLP problem, where the objective function is expressed as (1), the constraints are expressed in (6)-(19), and the decision variables are defined by:
(20) |
It can be observed that this problem is a highly nonconvex nonlinear problem with mixed decision variables. Addressing this problem on a real-time scale can be extremely challenging, particularly when accounting for uncertainties. A DRL approach is next proposed to address this problem.
We next map the mathematical model of the MG REM problem to an MDP, which is the mathematical foundation and modeling tool for DRL. The purpose of the MDP is to provide a framework for the agent to collaboratively find a policy to maximize its total accumulated reward. To achieve this, we describe the components of the MDP to ensure that its outcome also corresponds to the solution to the MG REM problem given in (1)-(19).
An MDP problem consists of a quintuple , where and are the state space and action space, respectively; is the state transition function; is the reward function; and is the discount factor. In each step of an MDP, the agent observes a state from the environment. Based on , the agent selects and executes an action . Then, the environment transitions to the next state according to the state transition function . The environment then returns a reward to the agent. This process continues through subsequent time steps until the required state or a predetermined termination condition is reached. These elements are defined as follows.
1) State. The following critical variables are used to form the state space:
(21) |
where , , , , , , , , , and are the vectors consisting of , , , , , , , , , and , respectively, and and are the output power of NGs.
2) Action. Given the sequential coupling characteristics exhibited by the output power of the DGs across various time periods, this paper adopts the output power increment as the action variable to decouple the output power of the DGs. Therefore, the action space can be represented by:
(22) |
where is the active output power increment vector of the DGs; is the terminal voltage vector of the DGs; and is the vector consisting of .
3) State transition function. In a real-world MG, state transitions occur spontaneously. However, in simulation scenarios, these transitions should be effectively characterized using the following formulations: in the next state , can be computed according to (23); and can be determined based on (13) and (18); , , , , , , , , and are known states; and the remaining states can be calculated through power flow computation in accordance with (12). In the power flow computation, we choose buses connected to DGs () as PV buses, buses connected to the utility grid as slack buses, and the remaining buses within the network framework as PQ buses. The power flow distribution within the power grid is then computed using the Newton-Raphson method as:
(23) |
4) Reward function. The total cost of the MDP problem is defined by (24). To maximize the satisfaction of the inequality constraints within the MINLP problem, we introduce a penalty term to penalize violations, as expressed in (25). The reward function rt for the MDP problem is then formulated according to the cost and overlimit penalties, as expressed in (26).
(24) |
(25) |
(26) |
where is the variable constrained by the inequality constraint; and are the lower and upper limits of the inequality constraint, respectively; is a binary variable that equals 1 when the power flow calculation does not converge and 0 when it does; and , , and are the cost factor, constraint penalty factor, and power flow penalty factor, respectively, and is a large constant.
Thus, the REM problem of MG is redefined as an MDP with a hybrid action space, which can be solved using regular DRL approaches. However, the following limitations may be encountered: ① inability to directly handle the hybrid action space; ② slow training speed; and ③ suboptimal feasibility of results. To overcome these limitations, the PH-PPO algorithm is applied.
This section describes the PH-PPO algorithm in detail, including an H-AC architecture, an experience-sharing-based parallel technique, and a safety projection technique that helps overcome the three aforementioned limitations.
Conventional DRL approaches can only address either a continuous or discrete action space. For the aforementioned MDP problem with a hybrid action space, a conventional DRL approach must first discretize the continuous actions, which may lead to decreased accuracy and the curse of dimensionality. For example, if all the continuous actions are discretized into levels, the action space would consist of distinct choices (corresponding to actions , , , and , respectively), where and are the numbers of DGs and ESSs, respectively. In this type of paradigm, the solution accuracy depends on the level of discrete granularity. However, an overly fine-grained discretization may lead to the curse of dimensionality, and thus hinder practical applications. To overcome these limitations, an H-AC architecture is developed as follows.
The H-AC architecture is grounded in the actor-critic architecture, which is widely employed in DRL approaches. The actor-critic architecture consists of two main components: an actor network that selects actions based on the policy, and a critic network that estimates the value function to compute the gradient of the parameters of the actor network. However, the H-AC architecture, which is tailored to address the hybrid action space problem, differs from the traditional actor-critic architecture which incorporates two actor networks.

Fig. 1 H-AC architecture.
The detailed form of policy distributions and can be expressed as:
(27) |
(28) |
where and are the actions of the action vectors and , respectively; and are the parameters of the two actor networks, respectively; and are the distributions of and , respectively; and are the categorical and Gaussian distributions, respectively; is the category count of ; is the probability that outputs ; and are the Gaussian distribution parameters of ; , , and are the outputs of the actor network; and and are the lengths of and , respectively.
Ideologically, the H-AC architecture shares essential similarities with a fully cooperative multiagent mechanism. It employs two actor networks to handle discrete and continuous actions separately while sharing the observation space, state-encoding layer, and critic network to update the parameters of the actor network. This enables direct adaptation to the hybrid action space and avoids the negative effects of the discretization operation.
The H-AC architecture serves only as a foundational framework and requires the selection of appropriate policy optimization algorithms such as trust region policy optimization [

Fig. 2 Architecture of PH-PPO algorithm.
In PPO, the actor and critic networks have different loss functions and update methods. The parameters of the critic network are updated through the optimization of the mean-square error loss function :
(29) |
(30) |
(31) |
where is the value of the current state estimated by the critic network; is the temporal difference (TD) target; and is the learning rate of the critic network.
The parameters of the actor network are updated through the optimization of the objective function :
(32) |
(33) |
where is the parameter of the actor network under the old policy; is the probability ratio, which serves as a metric for assessing the similarity between the new policy and old policies; the function constrains within and , which restricts the magnitude of updates to the new policy; is a hyperparameter that controls the degree of clipping; and is the advantage function. PPO exhibits the characteristics of a small deviation and large variance. However, in DRL, deviation can lead to local optima, whereas variance can result in low data utilization. Therefore, this paper introduces a generalized advantage estimation (GAE) technique to estimate the advantage function and strike a balance between deviation and variance [
(34) |
(35) |
where is an additional GAE hyperparameter; and is the TD error. At this juncture, can be updated using the gradient ascent as:
(36) |
where is the learning rate of the actor network. In the H-PPO algorithm, both discrete and continuous policies have their own loss functions, which are indicated in (32). In their own loss functions, the probability ratio considers only the discrete policy, whereas considers only the continuous policy.
In DRL approaches, offline training must sample substantial amounts of data by interacting with the MG REM simulator, which often requires significant CPU time consumption. To mitigate this limitation, we propose an experience-sharing-based parallel technique for the purpose of developing a parallel version of the H-PPO algorithm, which we refer to as the PH-PPO algorithm.
In the PH-PPO algorithm shown in
The experience-sharing-based parallel technique allocates sampling tasks to multicore CPUs and assigns a high-density gradient computational task to the GPU, thereby realizing a rational distribution of computational resources and accelerating the training speed. The experience-sharing-based parallel technique also allows multiple agents to explore different environment simultaneously and to share their individual experiences, which helps alleviate the sensitivity of the algorithm to random seeds and contributes to better training robustness.
In regular DRL approaches, violations of the operational constraints in the MG are often integrated as penalty terms into the reward function within the MDP framework [
Regular policy-based DRL typically employs a Gaussian distribution as the probability distribution for continuous actions. However, the unbounded nature of the Gaussian distribution can cause actions to fall into infeasible areas during the online execution stage. To address this issue, the probability distribution corresponding to specific actions , , and is reconstructed as a bounded Beta distribution. Consequently, (28) is superseded by (37), and the outputs of the continuous actor network as shown in Figs.
(37) |
where is the Beta distribution; and and are the Beta distribution parameters of .
In regular policy-based DRL, even invalid or unsafe actions are assigned a nonzero probability. When random policies are used, these invalid or unsafe actions can potentially be sampled during the online execution stage, leading to undesirable system behaviors or even system crashes. In addition, sampling invalid or unsafe actions can impede policy training because the collected experiences related to invalid actions are meaningless and can mislead the direction of policy updates [
In this paper, the proposed AM is presented in (38), where the “if” statement signifies the physical rule utilized to identify the invalid or unsafe action, and the “then” statement represents the mask that masks out the invalid or unsafe action. and are generated using (19) under the consideration that the output power of the ESSs does not cause the stored energy to exceed the upper and lower limits. - are based on (6), which takes into account that the on/off decision action and power increment action of the DGs are to be coordinated. Specifically, ensures that the power increment does not cause the output power to exceed its upper and lower limits when DGs are continuously on; ensures the maximum upward ramping rate limits of power increment when DGs are start-up; and ensure the maximum downward ramping rate limits of power increment when DGs are turned off; and ensures that the power increment action is masked when DGs are continuously off. When the AM configuration is utilized, the corresponding constraints in (6) and (19) can be guaranteed to be fully satisfied.
(38) |
The safety projection technique restricts the output action within a feasible range, which ensures that the associated inequality constraints in the MINLP problem are fully satisfied, thereby enhancing the decision feasibility. This technique also avoids exploration in the infeasible action intervals, thereby improving exploration efficiency.
We first introduce the parameter settings used to implement and test the proposed approach. Simulation results and comparisons with other SOTA approaches are then presented to demonstrate the effectiveness and superiority of the proposed approach.
The training and testing are conducted using a typical 15-bus MG, as illustrated in

Fig. 3 Typical 15-bus MG.
Line | From bus | To bus | Distance (km) | Line | From bus | To bus | Distance (km) |
---|---|---|---|---|---|---|---|
L1 | 1 | 2 | 1.6 | L8 | 1 | 5 | 1.6 |
L2 | 2 | 3 | 2.8 | L9 | 5 | 7 | 1.9 |
L3 | 1 | 4 | 0.1 | L10 | 7 | 11 | 0.3 |
L4 | 4 | 6 | 3.4 | L11 | 7 | 14 | 0.9 |
L5 | 6 | 8 | 0.3 | L12 | 11 | 12 | 1.2 |
L6 | 6 | 10 | 0.8 | L13 | 12 | 13 | 0.2 |
L7 | 8 | 9 | 1.2 | L14 | 1 | 15 | 0.1 |
Time period | Price ($/kWh) | Time period | Price ($/kWh) |
---|---|---|---|
08:00-14:00 | 0.14 | 20:00-22:00 | 0.14 |
14:00-20:00 | 0.24 | 22:00-08:00 | 0.06 |
Parameter | Value | Parameter | Value |
---|---|---|---|
Actor learning rate |
1×1 | GAE hyperparameter | 0.9 |
Critic learning rate |
5×1 | Clipping threshold | 0.2 |
Discount factor | 0.96 |
DG | (kW) | (kW) | ($) | (hour) | (hour) | (kW/h) | (kW/h) |
---|---|---|---|---|---|---|---|
MT | 900 | 50 | 26 | 1 | 1 | 900 | -900 |
DE | 1200 | 80 | 30 | 1 | 1 | 1200 | -1200 |
DG | ($/((kW)h)) | ($/kWh) | ($) |
---|---|---|---|
MT | 3.472×10 | 0.025002 | 48 |
DG | 3.086×10 | 0.016680 | 56 |
Parameter | Value | Parameter | Value |
---|---|---|---|
(kW) | 400 | ($/kWh) | 0.049 |
(kW) | -400 | 0.9 | |
(kWh) | 1800 | 0.9 | |
(kWh) | 400 |
A series of case studies are conducted to assess the effectiveness of the proposed approach for the MG REM problem and to showcase its superiority over several SOTA approaches. The performance of the proposed approach is evaluated comprehensively, encompassing both the training and test phases.
To verify the effectiveness of the H-AC architecture, the training process of the H-PPO algorithm is compared with that of the existing PPO algorithm. Notably, if we directly apply the PPO algorithm by discretizing all continuous actions into five levels, the action space is discretized to a size of 125000, making it impossible for the PPO algorithm to explore and converge efficiently in this REM problem. Thus, to facilitate a comparison with the PPO algorithm, we choose to set the voltage of the PV bus where the DGs are located at a fixed value of 1, which simplifies the AC power flow equation, as in [

Fig. 4 Comparison of training curves of H-PPO and PPO algorithms.
These findings show that for the PPO algorithm, a small granularity of discretization can result in the curse of dimensionality. By contrast, addressing the dimensionality curse by increasing the granularity may degrade the accuracy. Achieving a satisfactory trade-off between the two poses a significant challenge for the PPO algorithm. Unlike the PPO algorithm, the H-PPO algorithm can handle the hybrid action space directly, effectively avoiding the adverse effects of discretization.
To demonstrate the effectiveness of the experience-sharing-based parallel technique, the training process of the PH-PPO algorithm with varying numbers of workers () is investigated. Because different workers must use different random seeds to ensure the diversity of the collected samples, each experiment requires that a random seed cluster is set up. To test the robustness of the proposed approach, experiments are repeated using five random seed clusters. Notably, when the PH-PPO algorithm employs one worker, it is equivalent to the H-PPO algorithm.

Fig. 5 Training curves of PH-PPO algorithm using different numbers of workers.
With the exception of the speed advantage, we find that as the number of workers increases, the shaded region of the training curve of the PH-PPO algorithm decreases. This can be explained by the ability of the experience-sharing-based parallel technique to increase sample diversity, as it can integrate all samples related to each random seed within the random seed cluster to achieve a more comprehensive and unbiased evaluation. Therefore, once an outlier is sampled by a local actor dominated by a specific random seed, the samples collected by other local actors can help diminish its effects, thus effectively improving the overall training robustness.
To verify the effectiveness of the safety projection technique, a comparative study is conducted between the complete PH-PPO algorithm and a version that excludes the safety projection technique. For ease of assessment, we introduce the notion of a safe action [
We use the 30-day test dataset to calculate the safety action ratio of the two versions of the PH-PPO algorithm, as shown in
Algorithm | Safety action ratio (%) |
---|---|
With safety projection technique | 99.17 |
Without safety projection technique | 92.64 |
To verify the superiority of the proposed approach, it is compared with other SOTA real-time optimization approaches in terms of test results. The SOTA approaches include the aforementioned PPO algorithm, myopic policy, and MPC. To simulate the effects of sampling errors under these four approaches, random numbers following a Gaussian distribution are superimposed when sampling the power of the RESs and loads in real time, where is set to be 1% of the actual value. In the PH-PPO algorithm, the aforementioned three techniques that have been proven to be effective are considered. In the MPC approach, forecasting data for the power of RESs and loads are generated by adding a deviation to the actual values. This deviation is sampled from a Gaussian distribution in which the standard deviation is set to be 10% of the actual value. The look-ahead time window for the MPC approach is set to be four hours. The PH-PPO algorithm is also compared with the perfect information optimum (PIO) approach [
(39) |
where and are the operating costs obtained by the PIO and other approaches for a specific day, respectively.
After the training process is completed, a well-trained agent is applied to the test dataset. Using the 30-day test dataset, we calculate the daily operation costs of the REM problem of MG under various approaches, where the statistical results are presented in
Approach classification | Approach name | Mean cost ($) | The maximum cost ($) | The minimum cost ($) |
---|---|---|---|---|
Day-ahead benchmark | PIO | 856.90 | 1035.50 | 680.49 |
REM approach | Myopic | 1008.94 | 1188.43 | 832.73 |
MPC | 945.69 | 1122.00 | 781.31 | |
PPO | 931.64 | 1111.50 | 754.71 | |
PH-PPO | 889.85 | 1069.98 | 712.93 |

Fig. 6 Violin plot of relative costs of various approaches.
Figures

Fig. 7 Power curves of WT, PV, and total load for 24 hours in a given scenario.

Fig. 8 REM details of proposed approach.
Similar to [
The PH-PPO algorithm is compared with the approaches described earlier (i.e., PPO, myopic policy, MPC, and PIO), and the test results are presented in
Approach classification | Approach name | Mean cost ($) | Maximum cost ($) | Minimum cost ($) |
---|---|---|---|---|
Day-ahead benchmark | PIO | 1681.46 | 1900.32 | 1522.80 |
REM approach | Myopic | 1972.04 | 2262.15 | 1802.30 |
MPC | 1867.88 | 2084.48 | 1714.18 | |
PPO | ||||
PH-PPO | 1759.81 | 2005.94 | 1595.63 |
In this paper, a novel parallel hybrid DRL approach is proposed for the REM problem of MG. The unit commitment, AC power flow, and uncertainties are considered. The conclusions are as follows.
1) The PH-PPO algorithm adopts an H-AC architecture to handle the hybrid action space directly, which leads to faster convergence toward a superior solution as compared with regular DRL approaches.
2) The PH-PPO algorithm adopts a novel experience-sharing-based parallel technique that can fully utilize the computational resources of multicore CPUs and GPU, thus contributing to an improved convergence speed and training robustness.
3) The PH-PPO algorithm adopts a safety projection technique that can utilize prior-domain knowledge to enhance the feasibility of agent decision-making outcomes, thereby increasing the safety action ratio by 6.53%.
4) The test results confirm that the PH-PPO algorithm offers obvious advantages in terms of accuracy as compared with traditional REM approaches such as the myopic policy and MPC, while ensuring superior generalization and real-time decision-making capabilities.
In a future work, more realistic and refined environmental simulators including finer energy-storage systems, higher temporal resolutions, and more realistic electricity price settings will be considered. In addition, the PH-PPO algorithm could be further extended to a multi-agent DRL framework, providing a solution to the energy management problem of multi-MG systems. Finally, investigating other SOTA DRL approaches (e.g., soft AC) as policy optimization methods to further improve the performance of the PH-PPO algorithm will also be considered.
REFERENCES
Y. Zhuo, J. Zhu, J. Chen et al., “RSM-based approximate dynamic programming for stochastic energy management of power systems,” IEEE Transactions on Power Systems, vol. 38, no. 6, pp. 5392-5405, Nov. 2023. [Baidu Scholar]
S. Li, D. Cao, W. Hu et al., “Multi-energy management of interconnected multi-microgrid system using multi-agent deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1606-1617, Sept. 2023. [Baidu Scholar]
V. Murty and A. Kumar, “Optimal energy management and techno-economic analysis in microgrid with hybrid renewable energy sources,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 5, pp. 929-940, Sept. 2020. [Baidu Scholar]
M. F. Zia, E. Elbouchikhi, and M. Benbouzid, “Microgrids energy management systems: a critical review on methods, solutions, and prospects,” Applied Energy, vol. 222, pp. 1033-1055, Jul. 2018. [Baidu Scholar]
W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. Hoboken: Wiley, 2007. [Baidu Scholar]
K. B. Gassi and M. Baysal, “Improving real-time energy decision-making model with an actor-critic agent in modern microgrids with energy storage devices,” Energy, vol. 263, p. 126105, Jan. 2023. [Baidu Scholar]
H. Shuai, J. Fang, X. Ai et al., “Stochastic optimization of economic dispatch for microgrid based on approximate dynamic programming,” IEEE Transactions on Smart Grid, vol. 10, no. 3, pp. 2440-2452, May 2019. [Baidu Scholar]
H. Shuai, J. Fang, X. Ai et al., “Optimal real-time operation strategy for microgrid: an ADP-based stochastic nonlinear optimization approach,” IEEE Transactions on Sustainable Energy, vol. 10, no. 2, pp. 931-942, Apr. 2019. [Baidu Scholar]
J. Silvente, G. M. Kopanos, V. Dua et al., “A rolling horizon approach for optimal management of microgrids under stochastic uncertainty,” Chemical Engineering Research and Design, vol. 131, pp. 293-317, Mar. 2018. [Baidu Scholar]
Y. Zhang, F. Meng, R. Wang et al., “Uncertainty-resistant stochastic MPC approach for optimal operation of CHP microgrid,” Energy, vol. 179, pp. 1265-1278, Jul. 2019. [Baidu Scholar]
H. Shuai and H. He, “Online scheduling of a residential microgrid via Monte-Carlo tree search and a learned model,” IEEE Transactions on Smart Grid, vol. 12, no. 2, pp. 1073-1087, Mar. 2021. [Baidu Scholar]
X. Liu, T. Zhao, H. Deng et al., “Microgrid energy management with energy storage systems: a review,” CSEE Journal of Power and Energy Systems, vol. 9, no. 2, pp. 483-504, Mar. 2023. [Baidu Scholar]
M. L. Puterman, “Markov decision processes,” Handbooks in Operations Research and Management Science, vol. 2, pp. 331-434, Jan. 1990. [Baidu Scholar]
D. Liu, S. Xue, B. Zhao et al., “Adaptive dynamic programming for control: a survey and recent advances,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 142-160, Jan. 2021. [Baidu Scholar]
J. Hu, Y. Ye, Y. Tang et al., “Towards risk-aware real-time security constrained economic dispatch: a tailored deep reinforcement learning approach,” IEEE Transactions on Power Systems, vol. 39, no. 2, pp. 3972-3986, Mar. 2024. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]
H. Zhang, D. Yue, C. Dou et al., “Resilient optimal defensive strategy of TSK fuzzy-model-based microgrids system via a novel reinforcement learning approach,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 4, pp. 1921-1931, Apr. 2023. [Baidu Scholar]
V. François-Lavet, D. Taralla, D. Ernst et al. (2016, Nov.). Deep reinforcement learning solutions for energy microgrids management. [Online]. Available: http://orbi.ulg.ac.be/bitstream/2268/203831/1/EWRL_ Francois-Lavet_et_al.pdf [Baidu Scholar]
Y. Ji, J. Wang, J. Xu et al., “Real-time energy management of a microgrid using deep reinforcement learning,” Energies, vol. 12, no. 12, p. 2291, Jun. 2019. [Baidu Scholar]
H. Shuai, F. Li, H. Pulgar-Painemal et al., “Branching dueling Q-network-based online scheduling of a microgrid with distributed energy storage systems,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 5479-5482, Nov. 2021. [Baidu Scholar]
Y. Qi, X. Xu, Y. Liu et al., “Intelligent energy management for an on-grid hydrogen refueling station based on dueling double deep Q network algorithm with NoisyNet,” Renewable Energy, vol. 222, p. 119885, Feb. 2024. [Baidu Scholar]
P. Chen, M. Liu, C. Chen et al., “A battery management strategy in microgrid for personalized customer requirements,” Energy, vol. 189, p. 116245, Dec. 2019. [Baidu Scholar]
L. Lei, Y. Tan, G. Dahlenburg et al., “Dynamic energy dispatch based on deep reinforcement learning in IoT-driven smart isolated microgrids,” IEEE Internet of Things Journal, vol. 8, no. 10, pp. 7938-7953, Dec. 2020. [Baidu Scholar]
C. Guo, X. Wang, Y. Zheng et al., “Real-time optimal energy management of microgrid with uncertainties based on deep reinforcement learning,” Energy, vol. 238, p. 121873, Jan. 2022. [Baidu Scholar]
T. Nakabi and P. Toivanen, “Deep reinforcement learning for energy management in a microgrid with flexible demand,” Sustainable Energy, Grids and Networks, vol. 25, p. 100413, Mar. 2021. [Baidu Scholar]
H. Li, Z. Wan, and H. He, “Real-time residential demand response,” IEEE Transactions on Smart Grid, vol. 11, no. 5, pp. 4144-4154, Sept. 2020. [Baidu Scholar]
Y. Chen, J. Zhu, Y. Liu et al., “Distributed hierarchical deep reinforcement learning for large-scale grid emergency control,” IEEE Transactions on Power Systems, vol. 39, no. 2, pp. 4446-4458, Mar. 2024. [Baidu Scholar]
H. Li, Z. Wang, L. Li et al., “Online microgrid energy management based on safe deep reinforcement learning,” in Proceedings of 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, USA, Dec. 2021, pp. 1-8. [Baidu Scholar]
T. Lu, R. Hao, Q. Ai et al., “Distributed online dispatch for microgrids using hierarchical reinforcement learning embedded with operation knowledge,” IEEE Transactions on Power Systems, vol. 38, no. 4, pp. 2989-3002, Jul. 2023. [Baidu Scholar]
N. Heess, T. B. Dhruva, S. Sriram et al. (2017, Jul.). Emergence of locomotion behaviours in rich environments. [Online]. Available: http://arxiv.org/abs/1707.02286 [Baidu Scholar]
J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Aug.). Proximal policy optimization algorithms. [Online]. Available: http://arxiv.org/abs/1707.06347 [Baidu Scholar]
D. Chen, M. R. Hajidavalloo, Z. Li et al., “Deep multi-agent reinforcement learning for highway on-ramp merging in mixed traffic,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 11, pp. 11623-11638, Nov. 2023. [Baidu Scholar]
J. Zhu, Y. Zhuo, J. Chen et al., “An expected-cost realization-probability optimization approach for the dynamic energy management of microgrid,” International Journal of Electrical Power & Energy Systems, vol. 136, p. 107620, Mar. 2022. [Baidu Scholar]
RTE. (2024, Jan.). éCO2mix. [Online]. Available: https://www.rte-france.com/eco2mix. [Baidu Scholar]
P. Tian, X. Xiao, K. Wang et al., “A hierarchical energy management system based on hierarchical optimization for microgrid community economic operation,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2230-2241, Sept. 2016. [Baidu Scholar]
X. Xue, X. Ai, J. Fang et al., “Real-time schedule of microgrid for maximizing battery energy storage utilization,” IEEE Transactions on Sustainable Energy, vol. 13, no. 3, pp. 1356-1369, Jul. 2022. [Baidu Scholar]
M. Alshiekh, R. Bloem, R. Ehlers et al. (2018, Apr.). Safe reinforcement learning via shielding. [Online]. Available: https://arxiv.org/abs/1708.08611 [Baidu Scholar]
S. Gao, C. Xiang, M. Yu et al., “Online optimal power scheduling of a microgrid via imitation learning,” IEEE Transactions on Smart Grid, vol. 13, no. 2, pp. 861-876, Mar. 2022. [Baidu Scholar]
N. Zografou-Barredo, C. Patsios, I. Sarantakos et al., “Microgrid resilience-oriented scheduling: a robust misocp model,” IEEE Transactions on Smart Grid, vol. 12, no. 3, pp. 1867-1879, May 2021. [Baidu Scholar]
A. Gholami, T. Shekari, F. Aminifar et al., “Microgrid scheduling with uncertainty: the quest for resilience,” IEEE Transactions on Smart Grid, vol. 7, no. 6, pp. 2849-2858, Nov. 2016. [Baidu Scholar]
S. Zeinal-Kheiri, A. M. Shotorbani, and B. Mohammadi-Ivatloo, “Real-time energy management of grid-connected microgrid with flexible and delay-tolerant loads,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1196-1207, Nov. 2020. [Baidu Scholar]
M. Yin, K. Li, and J. Yu, “A data-driven approach for microgrid distributed generation planning under uncertainties,” Applied Energy, vol. 309, p. 118429, Jan. 2022. [Baidu Scholar]