Simplified Deep Reinforcement Learning Based Volt-var Control of Topologically Variable Power System

Qing Ma; Changhong Deng

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

OUTLINE

Abstract

The high penetration and uncertainty of distributed energies force the upgrade of volt-var control (VVC) to smooth the voltage and var fluctuations faster. Traditional mathematical or heuristic algorithms are increasingly incompetent for this task because of the slow online calculation speed. Deep reinforcement learning (DRL) has recently been recognized as an effective alternative as it transfers the computational pressure to the off-line training and the online calculation timescale reaches milliseconds. However, its slow offline training speed still limits its application to VVC. To overcome this issue, this paper proposes a simplified DRL method that simplifies and improves the training operations in DRL, avoiding invalid explorations and slow reward calculation speed. Given the problem that the DRL network parameters of original topology are not applicable to the other new topologies, side-tuning transfer learning (TL) is introduced to reduce the number of parameters needed to be updated in the TL process. Test results based on IEEE 30-bus and 118-bus systems prove the correctness and rapidity of the proposed method, as well as their strong applicability for large-scale control variables.

Keywords

Volt-var control (VVC); deep reinforcement learning (DRL); topologically variable power system; transfer learning

I. Introduction

THE proportions of wind, photovoltaic, and other distributed energies in the power system have dramatically increased in recent years. Due to the characteristics of random output as well as high-density injection, it often makes the voltage and var of local grid vary widely in a short time, e.g., the average voltage fluctuation of the 220 kV bus for a wind farm within 10 s could reach 6 kV and the maximum value within 2 s could be beyond 5 kV. These rapid fluctuation problems undoubtedly spawn the need to upgrade the volt-var control (VVC) [

1]-[3].

VVC is essentially a mixed-integer nonlinear optimization problem that coordinates discrete reactive power regulation equipment (such as capacitors, transformer taps) and continuous equipment (such as static var compensator (SVC), static var generator (SVG), reactive power of generators) to achieve global optimal operation of power system [

4], [5]. Traditional VVC methods are mainly divided into mathematical algorithms represented by interior point method (IPM) [6], [7] and heuristic algorithms represented by particle swam optimization (PSO) [8]-[10]. Mathematical algorithms have a strict theoretical basis and can theoretically converge to the global optimum. However, the modeling process is complicated as all the constraints need to be accurately modeled. The modeling process of heuristic algorithms is always simple while the iterative process requires repeated power flow calculation (PFC), and there is no guarantee that the solution is the global optimum. At the same time, the two kinds of algorithms share the common disadvantage of slow online calculation speed whose time scale is second or even minute, which is more apparent when the number of system nodes or control devices is extensive. Therefore, traditional VVC methods obviously cannot smooth the rapid fluctuations of voltage and var easily for future power systems, especially distribution networks with lots of distributed energy access.

To realize real-time response to voltage and var fluctuations, many scholars have introduced DRL, which has been applied in robot control, autopilot, and other complex control fields, into VVC recently [

11]-[19]. Regarding computation time, DRL transfers the online calculation pressure to offline training, and the online calculation for any new scenario only requires simple matrix operation based on well-trained neural networks, with a calculation time scale of millisecond. Regarding modeling complexity, DRL is easy to execute like heuristic algorithms, which doesn’t need to model all the constraints exactly one by one. Regarding interpretability, DRL has an excellent theoretical basis similar to mathematical algorithms, and the updating gradients of neural network parameters are obtained by strict reverse derivation [20].

DRL algorithms mainly include two categories, i.e., value-based and policy-based ones. The “actor-critic” type essentially belongs to the policy-based DRL algorithms, which can realize the direct mapping of state to action by establishing actor network. Meanwhile, it also absorbs the advantages of value-based DRL algorithms which evaluates action value by establishing critic network, bringing the single-step update to replace the iteration update used in the early policy-based algorithms, which greatly improves the training efficiency. Therefore, the existing research literature on the application of DRL to VVC mainly adopts the “actor-critic” type DRL algorithms, like deep deterministic policy gradient (DDPG) [

11]-[14], proximal policy optimization (PPO) [15], twin delayed deep deterministic policy gradient (TD3) [16], soft actor-critic (SAC) [17]-[19]. They all adopt the trained actor network to directly establish the end-to-end mapping between the power system state and control strategy of reactive power equipment while critic network is used to judge the quality of control strategy.

However, the “actor-critic” type also brings certain defects while combining the advantages of the above two types of DRL algorithms [

21]-[23]. As the premise of actor network generating excellent actions is that critic network can make accurate judgments on the values of different actions, if critic network itself is challenging to train, actor network will face more difficulties in convergence. In fact, the training goal of critic network is to satisfy the Bellman equation of action value in all scenarios. Compared with conventional end-to-end supervised training, the training difficulty is significantly increased, leading to the requirement for much more training samples and iterations. Meanwhile, a single training step of critic network involves calculating the value of current action, reward, and the value of following action. Compared with supervised training with the same network scale, computational complexity also increases. Therefore, when the “actor-critic” type DRL algorithms are applied to VVC, it still faces the defects of arduous training, time-consuming, and even training failure, which becomes more apparent when the number of system nodes or control devices is extensive [24]-[26].

In addition, in the existing research on applying DRL to VVC, the research objects are only for the power system with fixed topology. When the topology changes, the actor and critic networks trained for the original topology are no longer applicable. However, the system topology changes frequently due to equipment failures, load transfer, and routine maintenance in actual operation. If the network parameters suitable for new topology are only obtained from repeating all the training operations of DRL, the timeliness of actual DRL application to VVC will be significantly reduced.

To overcome the above shortcomings of DRL applied to VVC, this paper proposes a simplified DRL-based VVC of topologically variable power system. Compared with existing literature, the main contributions of this paper are as follows.

1) Simplification of critic network training. The Agent and Environment (power system) are set to interact only once in each iteration to set the reward function as the action value of the agent directly. Then, the critic network training is simplified to fit the nonlinear relationship between the power system state and node voltage in a supervised training way. Traditional PFC can be replaced by the simple forward calculation of critic network.

2) Simplification of actor network training. As training perfect actor network depends heavily on the judgment quality of critic network, the actor network training is set to start after the completement of critic network. Then, large amounts of invalid explorations in the early training stage for forming better critic network can be reduced, and the training efficiency of actor network can be significantly improved with the guidance of well-trained critic network from the start of training.

3) Fast training of DRL-based VVC for topologically variable power system. The side-tuning transfer learning (TL) is adopted to quickly obtain the network parameters suitable for the new topology with tiny training of the newly established small network. Compared with the conventional fine-tuning TL, the TL rate can be greatly improved.

The remainder of this paper is organized as follows. Section II is mainly the formulation of DRL-based VVC. Section III proposes the simplified DRL applied to VVC. Side-tuning TL-based VVC of topologically variable power system is elaborated in Section IV. Section V shows the general flowchart of the proposed method. The results of numerical tests are demonstrated in Section VI. Section VII states the conclusions.

II. Formulation of DRL-based VVC

A. Traditional VVC Mathematical Model

To ensure the safety of system operation, traditional VVC usually selects the voltage deviation as the optimization index. Taking the node voltage exceeding the limit as penalty function, the VVC mathematical model is commonly constructed as:

\{\begin{array}{l} m i n F = \sum_{i = 1}^{n} (V_{i} - V_{i, t a r})^{2} + λ V^{*} \\ V^{*} = \sum_{i = 1}^{n} (m a x (V_{i} - V_{i, m a x}, 0) + m i n (V_{i, m i n} - V_{i}, 0)) \\ s . t . P_{G i} - P_{L i} - V_{i} \sum_{j = 1}^{n} V_{j} (G_{i j} c o s δ_{i j} + B_{i j} s i n δ_{i j}) = 0 \\ Q_{G i} + Q_{C i} - Q_{L i} - V_{i} \sum_{j = 1}^{n} V_{j} (G_{i j} s i n δ_{i j} - B_{i j} s i n δ_{i j}) = 0 \\ C Q_{i, m i n} \leq C Q_{i} \leq C Q_{i, m a x} \end{array}

(1)

where $F$ is the objective function; n is the number of system nodes; $V_{i}$ and $V_{i, t a r}$ are the voltage and its target value, respectively; $V^{*}$ is the off-limit penalty function of voltage; $λ$ is the corresponding coefficient; $V_{i, m a x}$ and $V_{i, m i n}$ are the upper and lower voltage limits, respectively; $P_{G i}$ and $Q_{G i}$ are the active and reactive power outputs of generators, respectively; Q_Ci is the reactive power compensation; $P_{L i}$ and $Q_{L i}$ are the active and reactive loads, respectively; $G_{i j}$ and $B_{i}$ _j are the conductance and susceptance of the line, respectively; $δ_{i j}$ is the phase angle difference between the head and tail nodes; and $C Q_{i, m a x}$ and $C Q_{i, m i n}$ are the upper and lower regulation limits of reactive power equipment $C Q_{i}$ , respectively.

B. Combination of DRL and VVC

The main concepts involved in DRL include Agent, Environment, Action, State, and Reward. In this paper, Action, State, and Reward are abbreviated as A, S, and R, respectively. The goal of DRL is to train a policy $π$ that directly establishes the mapping between S and A, which maximizes the total expected discounted rewards of Agent, i.e., $π^{*} (A_{t} | S_{t}) = \underset{π}{a r g m a x} J (π)$ , and $J (π) = E (\sum_{t = 0}^{T} γ^{t} R_{t})$ , where $E (\cdot)$ is the mathematical expectation; T is the number of interactions between Agent and Environment; and $γ$ is the discount factor of R. The value of $π$ is updated based on the data samples stored during the iterations. Finally, the well-trained Agent can achieve excellent control strategies via $π$ with the minimum interaction steps in the face of any entirely new S.

As shown in Fig. 1, when DRL is applied to VVC, the above five concepts are set explicitly as follows.

Fig. 1 Concise schematic diagram of DRL.

1) Agent and Environment: the system operator or control program is set as Agent, and the power system interacting with Agent is set as Environment.

2) State: the set of power system real-time operating state parameters is set as S, which usually contains the active and reactive loads, the active power output of generators, and the operating state of all reactive power equipment.

3) Action: the control strategy of reactive power equipment generated for S is set as A.

4) Reward: the function that characterizes the quality of A is set as R. In fact, the commonly-used R in DRL is the same as the objective function F of traditional VVC mathematical model.

C. Elaboration of Actor-critic-type DRL Algorithm

The implementation object of the simplification strategy proposed in this paper is the actor-critic-type DRL algorithm, whose essence is to build critic network to evaluate different A generated by actor network, and to continuously upgrade the parameters of the two networks based on the data samples obtained from the continuous interactions between Agent and Environment. Finally, critic network can generate the most accurate value of A (abbreviated as Q), and actor network can generate the best A with the highest Q for different scenes.

As shown in Fig. 2, taking deep deterministic policy gradient (DDPG) [

27] as an instance, which is the most common among the actor-critic-type DRL algorithms, the interaction and upgrade process is described in detail when DDPG is applied to VVC. DDPG includes four neural networks, in which actor network and critic network are respectively used to generate the control strategy of reactive power equipment and judge the quality of control strategy, and actor-target network and critic-target network are auxiliary networks used to upgrade the parameters of actor and critic networks, respectively. The four networks are abbreviated as

μ

Q

μ^{*}

Q^{*}

, and the corresponding network parameters are

θ

ω

θ^{*}

ω^{*}

. In Fig. 2, a and q are the parameters used in training of actor network, where a is the action obtained by inputting S of training samples into actor network, and q is the evaluated value by inputting S+a into critic network; and the purple lines represent the training of actor network, while the yellow lines represent the training of critic network.

Fig. 2 Interactive update of DDPG.

1) Generation of training samples. As shown in (2), in each iteration, S randomly generated by Environment (power system) is input into actor network to generate A (reactive power equipment control strategy). After putting the noise-added A back into the Environment, a new $S^{*}$ , $R$ , and termination flag Done are obtained. Then, the data sample $[S, A, R, S^{*}, D o n e]$ formed in this interaction is stored in the replay buffer D.

\{\begin{array}{l} T S = [S, A, R, S^{*}, D o n e] \\ A = N (μ (S | θ), v_{a}) \\ R = P F (S, A) \end{array}

(2)

where TS is the generated training sample; N is the Gaussian distribution; $μ (\cdot)$ is the expectation; $v_{a}$ is the variance; and $P F (\cdot)$ represents that R is commonly calculated by the PFC of power system.

2) Training of critic network. The training goal of the critic network is to satisfy the Bellman equation of Q shown in (3). That is, the value $Q$ of current action $A$ judged by critic network equals the sum of R and the value $Q$ of the following action $A^{*}$ . Therefore, the training method takes the deviation between the actual $Q$ and the estimated $Q$ , called TD-error, as the loss function to train the critic network parameters based on the training samples randomly chosen from D.

Q (S, A | ω) = R + γ Q^{*} (S^{*}, A^{*} | ω^{*})

(3)

\{\begin{array}{l} J (ω) = \frac{1}{2 M} \sum_{i = 1}^{M} [Q (S_{i}, A_{i} | ω) - (R_{i} + γ Q^{*} (S_{i}^{*}, A_{i}^{*} | ω^{*} {))]}^{2} \\ ω = ω - α_{ω} \nabla J (ω) \end{array}

(4)

where M is the number of chosen training samples; $γ$ is the discount factor used to estimate $Q$ ; and $J (ω)$ and $α_{ω}$ are the loss function and learning rate of critic network, respectively.

3) Training of actor network. As the training goal of actor network is to generate A with the highest Q judged by critic network for all scenarios, Q is directly used as the loss function to train actor network.

\{\begin{array}{l} J (θ) = - \frac{1}{M} \sum_{i = 1}^{M} Q (S_{i}, μ (S_{i} | θ) | ω) \\ θ = θ - α_{θ} \nabla J (θ) \end{array}

(5)

where $J (θ)$ and $α_{θ}$ are the loss function and learning rate of actor network, respectively.

III. Simplified DRL Applied to VVC

For the “actor-critic” type DRL algorithms, the premise that actor network can get good control effects is that critic network can make the most accurate judgments on the value Q of different A. Since the impact of the current A on the future power system should be considered in the traditional DRL, the value Q of current A judged by critic network should satisfy the Bellman equation of Q, which is an equality constraint related to the current and subsequent A. Compared with the end-to-end type of supervised training, this kind of training will increase the training difficulty and consume more training samples without doubts. In addition, when DRL is applied to VVC, the value R contained in each sample can only be obtained by PFC without exception in the existing literature. But accurate PFC methods such as the Newton-Raphson method always take a long time to calculate once, so the overall training time of DRL will be significantly increased. To sum up, the core of improving the training speed of DRL-based VVC lies in adopting a simpler and faster way to train a critic network that can accurately judge the quality of different A.

The original intention of applying DRL to VVC is to replace the iterative calculation of traditional methods with the single swift calculation of DRL, to rapidly respond to voltage and var fluctuations caused by the random output of distributed energies. However, unlike the application scenarios of DRL in other fixed Environment such as Go and computer games, when DRL is applied to VVC, in the interactions between Agent and Environment, the variables in S other than reactive power equipment are also uncertain, such as changes of the active power output of distributed energies, load size, and even system topology. Therefore, this paper holds that when DRL is applied to VVC, it is not necessary to consider the impact of current A generated by actor network on the future power system, but only to use critic network to evaluate the control effect of A on the current scenario.

Based on the above analysis, this paper proposes a simplified DRL method based on actor-critic architecture for VVC, which includes the following three core ideas.

1) To enforce critic network paying all attention to the control effect of A corresponding to the current scenario, multiple interactions between the Agent and Environment in each iteration are simplified into one single interaction. The action value $Q$ generated by critic network is reduced to R obtained by the single interaction. The Bellman equation of Q is directly reduced to $Q (S, A | ω) = R$ .

2) The method for calculating R is changed from the traditional PFC to the forward calculation of critic network. The critic network training is directly simplified to the most commonly supervised training, whose input variables are the state parameters of power system while the output variables are the voltage of each node.

3) As the great simplification of critic network training and the premise of actor network training quality is that critic network can make accurate judgments on the value Q of different A, this paper changes the parallel training of the two networks. Critic network training is completed based on the supervised training first. Then, actor network is trained much more quickly with the help of well-trained critic network.

Based on the above core ideas, as shown in Fig. 3, the training process of simplified DRL-based VVC is as follows. In Fig. 3, the purple lines represent the training of critic network, while the yellow lines represent the training of actor network. The single iteration mainly includes three parts.

Fig. 3 Update of simplified DRL-based VVC.

1) Training of critic network. By randomly adjusting the node loads in 0-1.2 times the normal level, the active power output of generators in 0-1 times the rated power value, and the instructions of reactive equipment in the interval of upper and lower limits to form different operating scenarios, and performing PFC to get the node voltage, massive training samples can be obtained to conduct the supervised training of critic network.

\{\begin{array}{l} J (ω) = \frac{1}{2 M} \sum_{i = 1}^{M} (v (S_{i}, A_{i} | ω) - V (S_{i}, A_{i} {))}^{2} \\ ω = ω - α_{ω} \nabla J (ω) \end{array}

(6)

where $v$ and $V$ are the node voltage predicted by critic network and the label voltage obtained by PFC, respectively.

2) Unlike the traditional DRL which uses the value Q evaluated by critic network as the loss function for training actor network, this paper trains actor network parameters strictly following the chain derivative rule based on the well-trained critic network.

\{\begin{array}{l} \nabla J (θ) = \frac{1}{2 M} \sum_{i = 1}^{M} \frac{\partial R_{i}}{\partial V_{i}} \frac{\partial V_{i}}{\partial A_{i}} \frac{\partial A_{i}}{\partial θ} \\ θ = θ - α_{θ} \nabla J (θ) \end{array}

(7)

IV. Side-tuning TL-based VVC of Topologically Variable Power System

TL is introduced to solve the problem of updating the parameters of DRL networks quickly when the system topology changes in this paper. The definition of TL [

28], [29] refers to applying the knowledge learned in old tasks to new ones based on the similarity between old and new tasks. As TL can greatly improve the training speed of deep neural networks for new tasks, it has been widely used in image classification, advertising recommendation, and other fields that always adopt vast networks. TL can also be applied to the VVC of topologically variable power system, as only a small number of devices such as lines or transformers are put into or out of operation when the topology changes while most of the topology remains the same. Therefore, the knowledge contained in DRL networks trained for the original topology has important guiding significance for the new topology, and the DRL networks suitable for the new topology can be quickly trained based on TL.

The most common TL method is fine-tuning. As shown in (8), its core idea is to take the network parameters trained by the original task directly as the initial values of the new task, and then the network parameters suitable for the new task can be quickly trained through fewer training samples and iterations.

\{\begin{array}{l} T N = a r g m i n \sum_{i = 1}^{M_{t}} L (x_{t, i}, T N, y_{t, i}) \\ T N |_{I_{t} = 0} = B N \end{array}

(8)

where TN and BN are the networks of the new and original tasks, respectively; M_t is the number of training samples; and I_t is the current training iteration.

However, when fine-tuning is applied to DRL-based VVC of large-scale power systems, as the volume of DRL network parameters increases rapidly with the growth of the number of system nodes and fine-tuning involves all the parameters to update in each parameter upgrade, there will still be a problem of slow training speed for the new topology.

To this end, this paper introduces another TL method called side-tuning [

30]. As shown in (9) and Fig. 4(b), its core idea is that the output of new task is obtained by the weighted sum of the output of BN and that of a newly-established network called side-network SN with a smaller parameter volume. As the similarities between old and new tasks and BN for the old task provide the core cognition or perception, side-tuning is less prone to overfitting. In the TL process, only the parameters of SN are updated while those of BN remain unchanged. Compared with fine-tuning, the parameters to update involved in each iteration of side-tuning are greatly reduced and the TL efficiency can be significantly improved.

T N (x_{t}) = η \cdot B N (x_{t}) + (1 - η) \cdot S N (x_{t})

(9)

Fig. 4 Comparison of fine-tuning and side-tuning. (a) Fine-tuning. (b) Side-tuning.

where $η$ is the weighting factor of side-tuning.

As shown in Fig. 5, when the side-tuning is applied to the TL of the simplified DRL-based VVC, the updating gradients of SN for critic network and actor network can be computed easily by following the chain derivative rule:

\{\begin{array}{l} J (ω_{B N + S N}) = \frac{1}{2 M} \sum_{i = 1}^{M} [v (S_{i}, A_{i} | ω_{B N + S N}) - V (S_{i}, A_{i} {)]}^{2} \\ ω_{S N} = ω_{S N} - α_{ω} \frac{\partial J (ω_{B N + S N})}{\partial ω_{S N}} \end{array}

(10)

\{\begin{array}{l} \nabla J (θ_{S N}) = \frac{1}{2 M} \sum_{i = 1}^{M} \frac{\partial R_{i}}{\partial V_{i}} [η {(\frac{\partial V_{i}}{\partial A_{i}})}_{B N} + (1 - η) {(\frac{\partial V_{i}}{\partial A_{i}})}_{S N}] (1 - η) \frac{\partial A_{i}}{\partial θ_{S N}} \\ θ_{S N} = θ_{S N} - α_{θ} \nabla J (θ_{S N}) \end{array}

(11)

Fig. 5 Update of side-tuning TL of simplified DRL-based VVC.

where the subscripts BN and SN correspond to the network of original task and newly-established network, respectively. In Fig. 5, the purple lines represent the training of critic network, while the yellow lines represent the training of actor network.

V. General Flowchart of Proposed Method

The general flowchart of the proposed method is shown in Fig. 6, the left part is the simplified DRL applied to VVC while the right part is side-tuning TL-based VVC of topologically variable power system. All the improvements for DRL and TL applied to VVC are easy to execute.

Fig. 6 General flowchart of proposed method.

VI. Numerical Tests

The rapidity and correctness of the proposed method in this paper are verified using IEEE 30-bus and 118-bus systems, whose detailed parameters are obtained from MATPOWER of MATLAB. The control objects of VVC mainly include the generators and transformer taps. Therefore, the number of control variables for IEEE 30-bus system is 9 while those for 118-bus system reaches 64, which has been a relatively large number in the existing literature of DRL applied to VVC. The target of VVC is to make the voltage of all nodes close to the target value 1 p.u.. All the programs are written using Python 3.7.5 and the PFC involved is performed based on the PYPOWER toolkit from Python.

All the tests are performed on a Windows PC equipped with Intel Core i5-12500H CPU @ 2.5 GHz and 16 GB RAM.

The methods used for comparison are listed in Table I. All the methods share the same parameters in the overlapped computing parts listed in Table II. Meanwhile, all the methods share the same collection of system operating scenarios for both training and testing.

TABLE I Description of Different Methods

Method	Description
1	Simplified DRL (proposed method in this paper)
2	SAC (current best DRL method)
3	IPM (mathematical algorithm)
4	PSO (heuristic algorithm)
5	Side-tuning TL
6	Fine-tuning TL

TABLE II Basic Parameters Setting of Algorithms

Method	Parameters		Value
Method	Parameters		30-bus system	118-bus system
Simplified DRL/SAC	Actor network	Number of layers	4	5
		Node number of hidden layers	128	512
		Learning rate	0.004	0.004
	Critic network	Number of layers	4	5
		Node number of hidden layers	128	512
		Learning rate	0.004	0.004
	Iteration number		200	200
	Number of training samples		5000	10000
	Number of test samples		500	500
IPM	Central coefficient		0.1	0.1
IPM	Convergence precision		10^-6	10^-6
PSO	Number of particles		30	50
	Maximum speed coefficient		0.05	0.05
	Convergence precision		10^-6	10^-6
Side-tuning TL	SN of actor	Number of layers	3	4
		Node number of hidden layer	32	128
		Learning rate	0.002	0.002
	SN of critic	Number of layers	3	4
		Node number of hidden layers	32	128
		Learning rate	0.002	0.002
	Number of training samples		1000	2000

A. Validation of Simplified DRL-based VVC

Computed by the well-trained actor of methods 1, 2 and methods 3, 4, respectively, Fig. 7 compares the control effect in 500 different test scenarios using boxplots. The index used for comparison is the average node voltage deviation index $Δ V_{a v}$ as formulated in (12). In each boxplot of Fig. 7, the top and bottom solid horizontal lines represent the maximum and minimum $Δ V_{a v}$ , respectively, while the middle dotted horizontal lines represent the average $Δ V_{a v}$ . The above specific $Δ V_{a v}$ data and the computing time of each method are presented in Table III. Figure 8 compares the specific $Δ V_{a v}$ corresponding to scenarios 1-100 using different methods in the form of plotted lines. Figures 9 and 10 compare the training effect and training speed of methods 1 and 2 for 5 times using different random seeds. The index used for comparison of training effect is the average $Δ V_{a v}$ .

Δ V_{a v} = \frac{1}{n} \sum_{i = 1}^{n} | V_{i} - 1|

(12)

Fig. 7 General control effect comparison of methods 1-4.

TABLE III Statistics of

Δ V_{a v}

and Computing Time of Four Methods

Method	$Δ V_{a v}$			Computing time (s)
Method	Average	Maximum	Minimum	Computing time (s)
1	3.43×10^-3	4.78×10^-3	2.07×10^-3	3.8×10^-4
2	4.40×10^-3	6.19×10^-3	2.16×10^-3	3.8×10^-4
3	3.46×10^-3	4.81×10^-3	2.04×10^-3	9.0×10^-1
4	3.47×10^-3	4.86×10^-3	2.10×10^-3	4.1×10⁰

Fig. 8 Detailed control effect comparison of methods 1-4 in scenarios 1-100.

Fig. 9 Training effect comparison of methods 1 and 2.

Fig. 10 Training speed comparison of methods 1 and 2.

Firstly, the online control effect and calculation speed of the four methods are compared. As can be observed in Fig. 7 and Table III, the control effect of method 1 for 500 different scenarios is much similar to that of mathematical algorithms represented by method 3 and heuristic algorithms represented by method 4, and much better than traditional DRL algorithms represented by method 2.

It demonstrates that the simplification strategy for actor-critic-type DRL algorithm proposed in this paper works well, and the gradient descent optimization based on the backward derivation of R for training actor network can guarantee method 1 to achieve the same optimization accuracy as mathematical or heuristics algorithms. The above conclusions can be further illustrated in Fig. 8, where the curves formed by methods 1, 3, and 4 are glued together for the 100 displayed test scenarios, while the curve formed by method 2 is detached from the above three methods and exists in the upper space with larger voltage deviation. In terms of online calculation speed, as the forward calculation of actor network is only a straightforward matrix calculation process, the calculation speeds of methods 1 and 2 are much faster than those of methods 3 and 4, which require repeated iterative calculation.

Secondly, the offline training effect of methods 1 and 2 are compared. As can be observed in Fig. 9, since the structure of actor network and initialization parameters of methods 1 and 2 are exactly the same, the average $Δ V_{a v}$ calculated by the two methods coincides before the training starts. In the training process, as the critic network of method 2 needs a large amount of random exploration samples for training to satisfy the Bellman equation of Q gradually, and actor network needs perfect critic network to obtain excellent voltage control effect, the average $Δ V_{a v}$ of method 2 shows a trend of steep increase and slow decrease after then. The blue shadow denotes the envelope formed by different results from multiple experiments. While in the case of method 1, as the critic network has been trained in advance by supervised training, the critic network can combine the reward function to provide accurate evaluation of reactive power equipment control strategy for actor network from the start of training, and the gradient descent based on strict chain derivation also ensures that each update of actor network is moving towards the optimal direction, the average $Δ V_{a v}$ of method 1 can quickly converge in only about ten iterations and shows a monotonous decreasing trend in the whole training process.

Thirdly, the training speed of methods 1 and 2 are compared. As can be observed in Fig. 10, due to the setting of multiple interactions between Agent and Evironment in a single iteration of traditional DRL, each iteration at the beginning always reaches the upper limit of the interaction number because of imperfect critic and actor networks. Meanwhile, the traditional DRL has to adopt the traditional PFC method, such as Newton-Raphson method, to obtain R in each interaction. As a result, method 2 consumes much computing time in the initial 50 iterations, accounting for more than 4/5 of the total training time. While in the case of method 1, due to the single interaction setting at each iteration and traditional PFC replaced by the forward calculation of the well-trained critic network, the training time of method 1 is greatly reduced compared with method 2, and the training speed is improved by about 2.3 times though it takes some time to train critic network in advance.

B. Validation of Side-tuning TL-based VVC

Based on methods 1, 5, and 6, the test for the side-tuning TL-based VVC is carried out. The operating scenario before the network topology change is that all lines in the system are all running. The operating scenario after the network topology change is the line between nodes 10 and 21, and the line between nodes 8 and 28 in IEEE 30-bus system is disconnected.

Figures 11 and 12 present the training effect and training speed of methods 1, 5, and 6 during the training process. The index used for comparison of the training effect is also the average $Δ V_{a v}$ .

Fig. 11 Training effect comparison of methods 1, 5, and 6.

Fig. 12 Training speed comparison of methods 1, 5, and 6.

As can be observed in Fig. 11, before the training starts, the method with the largest average $Δ V_{a v}$ is method 1 whose parameters are all randomly initialized, the following is method 5 which inherits the parameters of the original task and randomly initializes the new SN at the same time, and the smallest is method 6 which sets the parameters of the original task as the initialization directly. It shows that the network parameters trained for the original topology still have strong guiding significance for the new topology though they cannot be directly applied. After the training starts, compared with method 1 which requires the almost 25 iterations to converge to the optimal value, methods 5 and 6 can both complete the convergence using only five iterations in an overlapping manner, which proves that side-tuning TL can achieve the same fast training effect as fine-tuning TL and the effectiveness of side-tuning TL applied to VVC is validated.

As shown in Fig. 12, it is evident that the training speed obtained by using side-tuning TL of method 5 is the fastest, nearly five times faster than method 1 and nearly two times faster than method 6. This comes from the following two reasons.

1) By adopting TL, the number of samples required for critic training of the new topology is much reduced, so the time for generating samples by using the traditional PFC is greatly reduced.

2) By adopting side-tuning TL, the actor and critic networks enjoy not only the good guidance of original topological network parameters, but also the parameters to be updated only involving the small volume SN, so that the calculation amount in each iteration is significantly reduced and the training speed of method 5 can be improved by about two times compared with method 6.

Based on the above simulation results and analysis, as long as method 1 is used to complete the training of actor and critic networks of a certain topology based on method 1, in the face of other topologies, the network parameters of actor and critic suitable for the new topology can be quickly obtained by using method 5 based on side-tuning TL, with the training speed improved by about ten times compared with traditional DRL which trains from the start.

C. Test on IEEE 118-bus System

To verify the generality of the proposed methods and their strong applicability for large-scale control variables, simulations are carried out on the IEEE 118-bus system which contains 64 control variables. The performances of various methods are shown in Table IV. The results demonstrate that the proposed method can drastically improve the training speed of DRL applied to VVC on the basis of effectively reducing the voltage deviation of power system. This is consistent with the conclusions on the IEEE 30-bus system.

TABLE IV Statistics of Calculation Results on IEEE 118-bus System

Validation	Method	Average $Δ V_{a v}$	Computing time (s)	Training time (s)
Validation of simplified DRL	1	3.07×10^-3	6.40×10^-3	597.1
	2	3.98×10^-3	6.40×10^-3	2320.6
	3	3.08×10^-3	4.20×10⁰
	4	3.12×10^-3	4.05×10¹
Validation of side-tuning TL	1	3.57×10^-3	6.40×10^-3	589.0
	5	3.57×10^-3	6.70×10^-3	183.4
	6	3.57×10^-3	6.40×10^-3	306.4

VII. Conclusion

This paper presents a simplified DRL-based VVC, which greatly simplifies the interaction and training process in DRL, and forces Agent to obtain the control strategy that minimizes the voltage deviation through only one single interaction when facing an entirely new operating scene. The test results prove that the proposed method can not only achieve the same calculation accuracy as traditional mathematical methods, but also significantly improve the training speed compared with the traditional DRL.

This paper introduces side-tuning TL to DRL-based VVC of topologically variable power system to reduce the number of parameters needed to update when the system topology changes, and makes mathematical derivation for the application of side-tuning TL to simplified DRL. Based on the test results, the proposed method can obtain faster training speed than traditional TL to improve the timeliness of simplified DRL applied to VVC greatly.

References

D. K. Molzahn, F. Dörfler, H. Sandberg et al., “A survey of distributed optimization and control algorithms for electric power systems,” IEEE Transactions on Smart Grid, vol. 8, no. 6, pp. 2941-2962, Nov. 2017. [Baidu Scholar]

M. H. J. Bollen, R. Das, S. Djokic et al., “Power quality concerns in implementing smart distribution-grid applications,” IEEE Transactions on Smart Grid, vol. 8, no. 1, pp. 391-399, Jan. 2017. [Baidu Scholar]

A. Bedawy, N. Yorino, K. Mahmoud et al., “Optimal voltage control strategy for voltage regulators in active unbalanced distribution systems using multi-agents,” IEEE Transactions on Power Systems, vol. 35, no. 2, pp. 1023-1035, Mar. 2020. [Baidu Scholar]

H. Ahmadi, J. R. Martí, and H. W. Dommel, “A framework for volt-var optimization in distribution systems,” IEEE Transactions on Smart Grid, vol. 6, no. 3, pp. 1473-1483, May 2015. [Baidu Scholar]

R. A. Jabr and I. Džafić, “Penalty-based volt/var optimization in complex coordinates,” IEEE Transactions on Power Systems, vol. 37, no. 3, pp. 2432-2440, May. 2022. [Baidu Scholar]

M. B. Liu, C. A. Canizares, and W. Huang, “Voltage and var control in distribution systems with limited switching operations,” IEEE Transactions on Power Systems, vol. 24, no. 2, pp. 889-899, May 2009. [Baidu Scholar]

H.-Y. Su and T.-Y. Liu, “Enhanced worst-case design for robust secondary voltage control using maximum likelihood approach,” IEEE Transactions on Power Systems, vol. 33, no. 6, pp. 7324-7326, Nov. 2018. [Baidu Scholar]

Y. -Y. Hong, F. -J. Lin, Y. -C. Lin et al., “Chaotic PSO-based var control considering renewables using fast probabilistic power flow,” IEEE Transactions on Power Delivery, vol. 29, no. 4, pp. 1666-1674, Aug. 2014. [Baidu Scholar]

Y. Malachi and S. Singer, “A genetic algorithm for the corrective control of voltage and reactive power,” IEEE Transactions on Power Systems, vol. 21, no. 1, pp. 295-300, Feb. 2006. [Baidu Scholar]

K. Mahmoud, M. M. Hussein, M. Abdel-Nasser et al., “Optimal voltage control in distribution systems with intermittent PV using multi-objective grey-wolf-Lévy optimizer,” IEEE Systems Journal, vol. 14, no. 1, pp. 760-770, Mar. 2020. [Baidu Scholar]

J. Duan, D. Shi, R. Diao et al., “Deep-reinforcement-learning-based autonomous voltage control for power grid operations,” IEEE Transactions on Power Systems, vol. 35, no. 1, pp. 814-817, Jan. 2020. [Baidu Scholar]

X. Sun and J. Qiu, “A customized voltage control strategy for electric vehicles in distribution networks with reinforcement learning method,” IEEE Transactions on Industrial Informatics, vol. 17, no. 10, pp. 6852-6863, Oct. 2021. [Baidu Scholar]

P. Li, M. Wei, H. Ji et al., “Deep reinforcement learning-based adaptive voltage control of active distribution networks with multi-terminal soft open point,” International Journal of Electrical Power & Energy Systems, vol. 141, pp. 1-10, 2022. [Baidu Scholar]

H. Liu, C. Zhang, Q. Chai et al., “Robust regional coordination of inverter-based volt/var control via multi-agent deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 5420-5433, Nov. 2021. [Baidu Scholar]

Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]

Y. Zhou, W. Lee, R. Diao et al., “Deep reinforcement learning based real-time AC optimal power flow considering uncertainties,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 5, pp. 1098-1109, Sept. 2022. [Baidu Scholar]

W. Wang, N. Yu, Y. Gao et al., “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008-3018, Jul. 2020. [Baidu Scholar]

H. Liu and W. Wu, “Two-stage deep reinforcement learning for inverter-based volt-var control in active distribution networks,” IEEE Transactions on Smart Grid, vol. 12, no. 3, pp. 2037-2047, May 2021. [Baidu Scholar]

D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]

D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]

E. Liang, R. Liaw, R. Nishihara et al., “RLlib: abstractions for distributed reinforcement learning,” Proceedings of Machine Learning Research, vol. 80, pp. 3053-3062, Dec. 2017. [Baidu Scholar]

A. Irpan. (2018, Feb.). Deep reinforcement learning doesn’t work yet. [Online]. Available: https://www. alexirpan.com/2018/02/14/rl-hard.html [Baidu Scholar]

P. Henderson, R. Islam, P. Bachman et al., “Deep reinforcement learning that matters,” Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 1-8, Mar. 2018. [Baidu Scholar]

R. Huang, Y, Chen, T. Yin et al., “Accelerated derivative-free deep reinforcement learning for large-scale grid emergency voltage control,” IEEE Transactions on Power Systems, vol. 37, no. 1, pp. 14-25, Jan. 2022. [Baidu Scholar]

A. Stooke and P. Abbeel. (2019, Jan.). Accelerated methods for deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1803.02811 [Baidu Scholar]

Q. Huang, R. Huang, W. Hao et al., “Adaptive power system emergency control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1171-1182, Mar. 2020. [Baidu Scholar]

T. P. Lillicrap, J. J. Hunt, A. Pritzel et al. (2015, Sept.). Continuous control with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1509.02971 [Baidu Scholar]

S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2010. [Baidu Scholar]

F. Zhuang, Z. Qi, K. Duan et al., “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43-76, Jan. 2021. [Baidu Scholar]

J. O. Zhang, A. Sax, A. Zamir et al. (2022, Jan.). Side-tuning: a baseline for network adaptation via additive side networks. [Online]. Available: https://arxiv.org/abs/1912.13503 [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher

Simplified Deep Reinforcement Learning Based Volt-var Control of Topologically Variable Power System PDF