Abstract
Modern power systems are experiencing larger fluctuations and more uncertainties caused by increased penetration of renewable energy sources (RESs) and power electronics equipment. Therefore, fast and accurate corrective control actions in real time are needed to ensure the system security and economics. This paper presents a novel method to derive real-time alternating current (AC) optimal power flow (OPF) solutions considering the uncertainties including varying renewable energy and topology changes by using state-of-the-art deep reinforcement learning (DRL) algorithm, which can effectively assist grid operators in making rapid and effective real-time decisions. The presented DRL-based approach first adopts a supervised-learning method from deep learning to generate good initial weights for neural networks, and then the proximal policy optimization (PPO) algorithm is applied to train and test the artificial intelligence (AI) agents for stable and robust performance. An ancillary classifier is designed to identify the feasibility of the AC OPF problem. Case studies conducted on the Illinois 200-bus system with wind generation variation and topology changes validate the effectiveness of the proposed method and demonstrate its great potential in promoting sustainable energy integration into the power system.
ALTERNATING current (AC) optimal power flow (OPF) remains an essential but challenging optimization problem for the operation and control of modern power system with high penetration of renewable energy sources (RESs). Many approaches in the literature have been proposed in recent decades to solve this non-convex and NP-hard problem, the solution of which is typically time-intensive to achieve the convergence for real-time application [
To address this issue, [
Inspired by the efforts above, this paper presents a novel DRL-based approach, the contributions of which are summarized below.
1) It adopts the proximal policy optimization (PPO) algorithm introduced in [
2) To facilitate the agent’s learning speed and performance during the offline training process, the supervised-learning regression method is applied to initialize the weights for the DRL agent, serving as an “initial guide”.
3) A reward function is carefully designed to tackle the feasibility issue, where the DRL agent learns an optimal stochastic policy. Therefore, compared with running many stochastic scenarios regarding the uncertainties under high penetration of RES and topology changes, the proposed method has the advantage to be applied in real-time security-constrained economic dispatch applications.
Numerical experiments conducted on the Illinois 200-bus system with RES and realistic operational data extracted from [
The remainder of this paper is organized as follows. Section II provides the problem formulation and the preliminaries of DRL algorithms. In Section III, the detailed procedures of the proposed methodology are illustrated. In Section IV, numerical experiments are conducted on the Illinois 200-bus system to demonstrate the performance of DRL agents and the effectiveness of the proposed method. Finally, Section V draws the conclusion and presents future work.
Considering an AC system with a set of buses , a set of transmission lines L with a total of nbr branches, and the generator buses with a total number of nG generators, the AC OPF problem can be formulated as:
(1) |
where ykl is the admittance between buses k and l; subscripts g and d represent the generator and load, respectively; P and Q are the active power and reactive power, respectively; and Vk and Slm are the bus voltage magnitudes at bus k and branch flow limit between bus l and m, respectively. In the model above, the wind farm is also considered as a PV (constant power and constant voltage) bus, and thus the corresponding operational limit of the first constraint in (1) becomes , where Pg_wind_k is the active power output regarding the
The motivation of applying DL to solve the AC OPF is to find a mapping function represented by a DNN parameterized by between the operating states and optimal generator settings such that the solving speed can be improved significantly. Unlike [
(2) |
However, if the DNN is only trained by adopting (2), the feasibility of the AC OPF problem cannot be guaranteed after running the power flow (PF) solver during online implementation even though the loss is small due to operational security limit violations defined in (1). Although [
The goal of DRL is to train an agent aiming at learning an optimal policy that maximizes the expected reward return by continuously interacting with the environment [
Categorized as one actor-critic type of RL algorithms, the PPO agent consists of two DNNs, where the first DNN, the “actor”, is trained to learn the stochastic optimal policy, and the second DNN, the “critic”, is designed to estimate the value function. The PPO algorithm ensures an improved performance compared with other policy gradient algorithms due to the following two kinds of enhancement regarding the “actor” updates. Firstly, the generalized advantage estimation (GAE) function is utilized during the “actor” training process to reduce the variance of the estimation as shown in (3) [
(3) |
where is the state value representing how good a state is by calculating the expected reward starting from state st at time step t following a certain policy, which is the output of the “critic” network; controls the average degree of n-step advantage values; rt is the immediate reward from the environment at time step t; and is the discount factor on the future reward.
Secondly, the PPO algorithm updates the “actor” parameters within an appropriate trust region, and this helps avoid falling off the “cliff” from the hyper-surfaces of the reward functions which may be hard to escape from. Such a safe update is achieved by modifying the objective function
(4) |
where indicates the parameters of the DNN (at|st) of the “actor”; determines the range of the trust region for the update; and the advantage value At is calculated from (3) . The minimization operator makes sure that the new policy does not benefit from going too far away from the old policy and thus regulates the update of the DNN parameters.
Besides, the policy in PPO is stochastic, which is parameterized as a conditional Gaussian policy . The mean value is the output of the DNN in the “actor”, and the covariance is initially assigned manually but will be updated during backpropagation. Besides, indicates the policy parameters before updating the “actor”.
As for the DNN in the “critic” parameterized by , which is designed to estimate the value function , the objective function to update the “critic” is shown in (5).
(5) |
(6) |
where Rt is the discounted accumulated reward; Dbatch is the trajectories accumulated from the agent interacting with the environment with batch size Nbatch; and el is the episode length when the agent interacts with its environment.
The proposed DRL-based framework for AC OPF solutions is illustrated in

Fig. 1 DRL-based framework to solve AC OPF problem.
The state, which is the input for the PPO agent, includes the active and reactive power (Pdi and Qdi) of system loads at all buses (), the magnitude and angle of the diagonal elements of the admittance matrix Y, and all nG generators’ initial active power setting Pgj and voltage setting Vgj (), as denoted in (7). The MinMax scaling preprocessing technique [
(7) |
The action spaces are the incremental adjustments made to generator set-points shown in (8) instead of optimal generator set points due to the training interactions between the DRL agent and its environment. Then, the well-trained DRL agent could adaptively achieve the optimal status with several adjustment steps during the online testing process, although our training target is to achieve the optimality in one step.
(8) |
The DNN structures for the “actor” and the “critic” in PPO are shown in

Fig. 2 DNN structure for “actor” in PPO training.

Fig. 3 DNN structure for “critic” in PPO training.
Due to the consideration of system topology information, the input state dimension is two times larger compared with those in [
To facilitate the DRL training process for solving the AC OPF problem with large state and action spaces, if the agent starts training from a good initial status, it could solve the sample inefficiency caused by numerous trials and errors without experts’ demonstration. Therefore, the DRL training process could be sped up and become more effective. On the other hand, the DL training result could serve as a validation process for the structure of the DNN in the “actor”. However, the difference here is that the “labels” in the initialization process become optimal generator setting adjustments shown in (8) for further DRL training. After collecting the optimal action labels and states to by the training dataset Dtrain with the size of NDL by running AC OPF solver offline, adopting (9) as the loss function and applying the first-order optimizer such as stochastic gradient descent, the initial mean value of the stochastic policy in PPO agent could be trained to clone the optimal generator settings from AC OPF solution results.
(9) |

Fig. 4 Flowchart of power system environment interacting with an agent.
The detailed design of the reward function is given in (10).
(10) |
where Rpg_v, Rv_v, and Rbr_v are shown in (11) corresponding to negative rewards if violations of any inequality constraints are detected, including: ① the active power limits of generators; ② the voltage magnitude limits of buses; and ③ the thermal flow limits (in both directions) of transmission lines. Variable Costsgen in (10) is the total generation cost value of the power system.
(11) |
With the DL-based initialization, the DRL training can produce more reliable and improved results. A brief illustration of PPO training is shown in

Fig. 5 A brief illustration of PPO training in this paper.
In
As for the computational time analysis, the process of the proposed approach for solving the AC OPF problem during the online implementation consists of two parts: the feed-forward calculation time only in the well-trained “actor” DNN shown in
Because of the improved solving time to obtain the near-optimal solutions, the well-trained agent could run more stochastic scenarios resulted from the RES uncertainties and topology changes compared with the conventional interior-point solver. Since the well-trained agent could learn the stochastic optimal policy of the feasible AC OPF solutions, the agent could be applied online as shown in
The proposed approach to solve the AC OPF problem considering wind integration and topology changing scenarios is tested on the Illinois 200-bus system (with 200 buses, original 38 generators, 1 wind farm connected with bus 161, and 245 lines) [
1) Data generation: each load is randomly perturbed between [0.6, 1.4] p.u. with uniform distribution, where the original data file is considered as the base case; each generator’s set point including the wind farm’s output is also randomly perturbed between [Pgmin, Pgmax] for active power control and [Vgmin, Vgmax] for reactive power control; a transmission line is randomly chosen to be tripped to simulate the topology changing scenarios under the uniform distribution (only including the data with feasible solutions from IPS).
2) Label creation: the IPS is adopted to generate the optimal action labels for the “actor” initialization, and to indicate whether the AC OPF problem is feasible or not.
3) Data arrangement: all the data with feasible AC OPF solutions are collected and divided into 3 datasets: 130000 data with both original system’s topology and transmission line tripping conditions forms the training dataset used for “actor” initialization and PPO training; 23489 data with original system topology in the testing dataset I and 11511 data with topology changes form testing dataset II (35000 testing data in total) used for testing the trained agent online and verifying its performance. Besides, to further validate the well-trained agent’s performance regarding the realistic operating scenarios with uncertainties, both the real-time load and wind power data per 5 min from CAISO in August 2019 [
The cost comparison in percentage , feasibility rate, and the total computation time are chosen as performance evaluation indices during the online testing process. The cost comparison in percentage , which describes the optimality shown in (12) [
(12) |
In this paper, a rated 150 MW wind farm is connected to bus 161 and all bus voltage magnitude limits are modified from [0.9, 1.1]p.u. to [0.95, 1.05]p.u.. Accordingly, the dimensions of the state and action space are 878 and 78, respectively. The maximum episode length T is set to be 100. One convolutional layer with [
The initialization results applying (9) to train the “actor” with the convolutional layer is shown in

Fig. 6 DL initialization results. (a) Training loss curve. (b) Relative error for Pg. (c) Relative error for Vg.

Fig. 7 PPO training process where reward is rescaled 1000 times smaller.
The well-trained PPO agent is then adopted to perform the online AC OPF task on the testing dataset and the corresponding results are shown in
Due to a smaller feasible region in scenarios, all violation data in testing dataset II trigger the bus voltage magnitude violation flag. The PPO’s on-policy characteristic may be eligible to explain why the agent cannot solve the very small portion of violation data shown in
Furthermore, the running time comparison is made by using a desktop equipped with Intel i7-7700 CPU and 8 GB RAM. To obtain near-optimal solutions for the 23489 data in the testing dataset I, the running time from the IPS (the initial-point vector is set as the mean values of the decision variables’ lower and upper bounds as the default in the PYPOWER) costs 6.1 hours, while it only takes 0.41 hours from the proposed method, indicating an average speedup factor of approximately 14 times. It could be even faster if a GPU is used.
To verify the effectiveness of securing the post-contingencies on the topology changes, another new testing dataset with 1000 data is generated regarding the selective 21 contingency scenarios, which are shown in

Fig. 8 Online testing results of initialized PPO agent under selective post-contingencies.
As shown in
To further show the effectiveness and robustness of the proposed approach, the real-time data with 5 min intervals of CAISO in August 2019 is applied as the new on-line testing data shown in

Fig. 9 Real-time load and wind power profiles per 5 min from CAISO in August 2019.

Fig. 10 Online testing results of initialized PPO agents for real-time data from CAISO in August 2019 under original system topology.
The positive rewards in

Fig. 11 Online testing results of initialized PPO agent for real-time data from CAISO in August 2019 under topology change conditions (one random transmission line is tripped).
To deal with the randomness and uncertainty brought in by high-penetration RESs, it is envisioned that faster control and decision-making are needed in operating the power systems in the future. Therefore, in this paper, we randomly pick real-time data from

Fig. 12 Real-time load and wind power profiles from CAISO with interpolation per 6 s and added noises on August 2, 2019.

Fig. 13 Online testing results of initialized PPO agent for real-time data from CAISO with interpolation per 6 s and added noises on August 2, 2019 under original system topology.

Fig. 14 Online testing results of initialized PPO agent for real-time data from CAISO with interpolation per 6 s and added noises on August 2, 2019 under topology change conditions (one random transmission line is tripped).
Moreover, the selective post-contingencies shown in

Fig. 15 Online testing results of initialized PPO agent for real-time data from CAISO on August 2, 2019 under selective post-contingencies.
From
An additional ancillary and independent “alarm” function can be designed to help system operators identify whether the current load, RES power outputs, and system topology information would lead to infeasibility from IPS. This is formulated as a classification problem (the label is 0 if infeasible, otherwise 1). By running the IPS to generate data under various conditions including topology changes (one random transmission line is tripped), 140000 data samples are adopted as a training dataset.

Fig. 16 Training process for feasibility classification of AC OPF problem. (a) DNN structure. (b) Accuracy of training process.
We choose three representative operating conditions from
1) Peak time at 18:02, where the load is around 1.19 p.u. and wind power output is around 0.63 p.u..
2) Off-peak time at 03:11, where the load is around 0.66 p.u. and wind power output is around 0.52 p.u..
3) Time at 12:38, where the wind power output is at the relatively low level with 0.045 p.u. output and the load is around 0.85 p.u..
The corresponding results are shown in

Fig. 17 Cost comparison for all contingencies on topology changes for peak time, off-peak time, and low wind power output time on August 2, 2019 from CAISO profile. (a) Peak time. (b) Off-peak time. (c) Low wind power output time.
This paper proposes a novel framework of deriving fast AC OPF solutions for real-time applications using deep reinforcement learning. Case studies are based on the Illinois 200-bus system, and real-time data from CAISO is also adopted. The testing results demonstrate that after the offline DRL training, the near-optimal AC OPF solutions can be accomplished with at least 14 times speedup compared with the interior-point method. Moreover, the well-trained DRL agent is robust to achieve near-optimal status to deal with the uncertainties of RES and topology changes, which provides great potential for the operation and control of modern power system with high penetration of renewable energy. Although only the outage of one random transmission line is included as the uncertainty regarding the topology change scenarios, it could be expanded to the outages of multiple transmission lines with higher computation burden. Furthermore, an efficient and robust classifier, which serves as an independent “alarm” function, is designed to help system operators identify the feasibility of the AC OPF problem under the present conditions of loading, RES outputs, and topology.
Future work includes further improvements on the AI agent to gain higher accuracy, application of GPU for process parallelization and better speedup, and test of the proposed approach on larger power systems. Besides, the constraint of the ramping rate limits for the generators will be considered for solving the multi-period AC OPF problem. On the other hand, as the results from the solver are applied for the initialization process of the agent and IPS in this paper only considers pre-contingency states, it requires further investigation to guarantee the security when considering full contingencies for solving the AC OPF problem.
References
Y. Tang, K. Dvijotham, and S. Low, “Real time optimal power flow,” IEEE Transactions on Smart Grid, vol. 8, no. 6, pp. 2963-2973, Nov. 2017. [Baidu Scholar]
E. Dall’Anese and A. Simonetto, “Optimal power flow pursuit,” IEEE Transactions on Smart Grid, vol. 9, no. 2, pp. 942-959, Mar. 2018. [Baidu Scholar]
X. Pan, T. Zhao, and M. Chen, “DeepOPF: deep neural network for DC optimal power flow,” in Proceedings of 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, Beijing, China, Oct. 2019, pp. 1-12. [Baidu Scholar]
D. Deka and S. Misra, “Learning for DC-OPF: classifying active sets using neural nets,” in Proceedings of 2019 IEEE Milan PowerTech, Milan, Italy, Jun. 2019, pp. 1-6. [Baidu Scholar]
A. Venzke, G. Qu, S. Low et al., “Learning optimal power flow: worst-case guarantees for neural networks,” in Proceedings of 2020 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, Tempe, USA, Nov. 2020, pp. 1-7. [Baidu Scholar]
A. S. Zamzam and K. Baker, “Learning optimal solutions for extremely fast AC optimal power flow,” in Proceedings of 2020 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, Tempe, USA, Nov. 2020, pp. 1-7. [Baidu Scholar]
D. Owerko, F. Gama, and A. Ribeiro, “Optimal power flow using graph neural networks,” in Proceedings of 2020 IEEE International Conference on Accoustics, Speech and Signal Processing, Barcelona, Spain, May 2020, pp. 1-5. [Baidu Scholar]
F. Fioretto, T. W. K. Mak, and P. V. Hentenryck, “Predicting AC optimal power flows: combining deep learning and lagrangian dual method,” in Proceedings of 2020 AAAI Conference on Artificial Intelligence, New York, USA, Feb. 2020, pp. 630-637. [Baidu Scholar]
M. Chatzos, F. Fioretto, T. W. K. Mak et al. (2020, Jun.). High-fidelity machine learning approximations of large-scale optimal power flow. [Online]. Available: https://arxiv.org/abs/2006.16356 [Baidu Scholar]
X. Pan, M. Chen, T. Zhao et al. (2020, Jul.). DeepOPF: a feasibility-optimized deep neural network approach for AC optimal power flow problems. [Online]. Available: https://arxiv.org/abs/2007.01002v1 [Baidu Scholar]
T. Yu, J. Liu, K. W. Chan et al., “Distributed multi-step Q(λ) learning for optimal power flow of large-scale power grids,” International Journal of Electrical Power and Energy Systems, vol. 42, no. 1, pp. 614-620, Nov. 2012. [Baidu Scholar]
Z. Yan and Y. Xu, “Real-time optimal power flow: a lagrangian based deep reinforcement learning approach,” IEEE Transactions on Power Systems, vol. 35, no. 4, pp. 3270-3273, Apr. 2020. [Baidu Scholar]
J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [Baidu Scholar]
CAISO. (2020, Dec.). Today’s California ISO website. [Online]. Available: http://www.caiso.com/TodaysOutlook/Pages/default.aspx [Baidu Scholar]
ERCOT. (2020, Dec.). ERCOT protocols & operating guides on reactive testing. [Online]. Available: http://www.ercot.com/content/wcm/key_documents_lists/54515/Reactive_Testing__ERCOT_Protocols_Op._Guides.pdf [Baidu Scholar]
V. Francois-Lavet, P. Henderson, R. Islam et al., “An introduction to deep reinforcement learning,” Foundations and Trends in Machine Learning, vol. 11, no. 3-4, pp. 219-354, Dec. 2018. [Baidu Scholar]
OpenAI. (2020, Dec.). OpenAI blog: proximal policy optimization. [Online]. Available: https://openai.com/blog/openai-baselines-ppo/ [Baidu Scholar]
J. Schulman, P. Moritz, S. Levine et al. (2018, Oct.). High-dimensional continuous control using generalized advantage estimation. [Online]. Available: https://arxiv.org/abs/1506.02438v4 [Baidu Scholar]
J. Duan, D. Shi, R. Diao et al., “Deep-reinforcement-learning-based autonomous voltage control for power grid operations,” IEEE Transactions on Power Systems, vol. 35, no. 1, pp. 814-817, Sept. 2019. [Baidu Scholar]
F. Pedregosa, G. Varoquaux, and J. Vanderplas. “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825-2830, Oct. 2011. [Baidu Scholar]
G. Serpen and Z. Gao, “Complexity analysis of multilayer perceptron neural network embedded into a wireless sensor network,” Procedia Computer Science, vol. 36, pp. 192-197, Nov. 2014. [Baidu Scholar]
Kasper Fredenslund. (2020, Dec.). Computational complexity of neural networks. [Online]. Available: https://kasperfred.com/series/introduction-to-neuralnetworks/computational-complexity-of-neural-networks [Baidu Scholar]
T. H. Cormen, C. E. Leiserson, R. L. Rivest et al., Introduction to Algorithms, 3rd ed., Cambridge: MIT Press, 2009. [Baidu Scholar]
D. P. Kingma and J. Ba. “Adam: a method for stochastic optimization,” in Proceedings of 3rd International Conference on Learning Rresentations (ICLR), San Diego, USA, May 2020, pp. 1-15. [Baidu Scholar]
Texis A&M University. (2020, Dec.). Electric grid test case repository. [Online]. Available: https://electricgrids.engr.tamu.edu/electric-grid-test-cases/ [Baidu Scholar]
Pypower. (2020, Dec.). Pypower 5.1.4. [Online]. Available: https://pypi.org/project/PYPOWER/ [Baidu Scholar]
R. D. Zimmerman, C. E. Sanchez, and R. J. Thomas, “MATPOWER: steady-state operations, planning, and analysis tools for power systems research and education,” IEEE Transactions on Power Systems, vol. 26, no. 1, pp. 12-19, Feb. 2011. [Baidu Scholar]
Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]
Y. Dvorkin, M. Lubin, S. Backhaus et al., “Uncertainty sets for wind power generation,” IEEE Transactions on Power Systems, vol. 31, no. 4, pp. 3326-3327, Jul. 2016. [Baidu Scholar]
K. Baker, “Solutions of DC OPF are never AC feasible,” in Proceedings of 12th ACM International Conference on Future Energy Systems, Virtual, Italy, Jun. 2021, pp. 264-268. [Baidu Scholar]