Data-driven Optimal Control Strategy for Virtual Synchronous Generator via Deep Reinforcement Learning Approach

Yushuai Li; Wei Gao; Weihang Yan; Shuo Huang; Rui Wang; Vahan Gevorgian; David Wenzhong Gao

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

Data-driven Optimal Control Strategy for Virtual Synchronous Generator via Deep Reinforcement Learning Approach PDF

- ORCID：
Yushuai Li (Member, IEEE)
✉
- ORCID：
Wei Gao (Student Member, IEEE)
✉
- ORCID：
Weihang Yan (Student Member, IEEE)
✉
- ORCID：
Shuo Huang (Student Member, IEEE)
✉
- ORCID：
Rui Wang
✉
- ORCID：
Vahan Gevorgian (Senior Member, IEEE)
✉
- ORCID：
David Wenzhong Gao (Fellow, IEEE)
✉

Department of Electrical and Computer Engineering, University of Denver, Denver, CO 80208, USA； School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, China； National Renewable Energy Laboratory, Golden, USA

Updated：2021-08-02

DOI：10.35833/MPCE.2020.000267

Abstract

This paper aims at developing a data-driven optimal control strategy for virtual synchronous generator (VSG) in the scenario where no expert knowledge or requirement for system model is available. Firstly, the optimal and adaptive control problem for VSG is transformed into a reinforcement learning task. Specifically, the control variables, i.e., virtual inertia and damping factor, are defined as the actions. Meanwhile, the active power output, angular frequency and its derivative are considered as the observations. Moreover, the reward mechanism is designed based on three preset characteristic functions to quantify the control targets: ① maintaining the deviation of angular frequency within special limits; ② preserving well-damped oscillations for both the angular frequency and active power output; ③ obtaining slow frequency drop in the transient process. Next, to maximize the cumulative rewards, a decentralized deep policy gradient algorithm, which features model-free and faster convergence, is developed and employed to find the optimal control policy. With this effort, a data-driven adaptive VSG controller can be obtained. By using the proposed controller, the inverter-based distributed generator can adaptively adjust its control variables based on current observations to fulfill the expected targets in model-free fashion. Finally, simulation results validate the feasibility and effectiveness of the proposed approach.

Keywords

Adaptive control; virtual synchronous generator (VSG); reinforcement learning; deep learning

I. Introduction

THE increasing pressure from environment protection has made it urgent to conduct the research on accommodating high penetration level of renewable energy [

1]-[4]. The renewable energy resources are converted to electricity which is then injected into the power system via power electronic inverters [5], [6]. Unlike the conventional synchronous generator (SG) with inherent rotating inertia, inverter-based distributed generator (IBDG) does not provide inertia support, which may make the system sensitive to network disturbances and even jeopardize system stability [7], [8]. To remedy the problem of system inertia, virtual synchronous generator (VSG), as a promising solution, has been proposed to control the grid-connected inverter to emulate the dynamic behavior of SGs [9], [10]. By designing the level of virtual inertia as well as damping, VSG can respond like the SG with slow frequency drop, which is beneficial for the frequency stability of power system [11], [12]. Therefore, the study of optimal control strategy of VSG becomes more significant to ensure high-quality power injection and maintain the safe operation of power system.

It is notable that the control operation of VSG is executed by software. As a result, the control parameters, i.e., virtual inertia and damping factor, can be set arbitrarily without physical limits. Up to now, a lot of control strategies for VSG have been presented to achieve the desired dynamic performance, which can be roughly classified into two categories, i.e., rule-based approach and optimization-based approach. The rule-based approach determines the control behavior by using the predefined operation rule. For instance, an adaptive-gain inertial control is proposed in [

13], which focuses on improving the frequency nadir and guaranteeing the stable operation. By evaluating the change of rotor speed as well as the change of its differential, some adjustment strategies for a class of adaptive parameter(s) of VSG, i.e., inertia and/or damping factor, are proposed in [14]-[16]. Based on the preset operation table, the control parameters can be adaptively increased or decreased within a range of large and small parameters in different intervals, with final objective to achieve small over-shoot and short settling time. Based on small-signal modeling, a simple step-by-step parameter design strategy is presented in [17], which can take the double-line-frequency ripple into consideration. To achieve the tradeoff between active power and frequency regulations, a dual-adaptivity inertia control strategy is proposed in [18], which is based on a preset operation principle to get the range of adaptivity. Recently, [19] analyzes the transient stability of VSG and proposes a novel mode-adaptive power-angle control to enhance the transient stability effectively. By using this approach, the positive-feedback mode of power-angle control of the VSG can be adaptively switched to the negative-feedback one after large disturbances, which avoids the loss of synchronization. Although the rule-based approaches are easy for implementation, the predefined rules depend on expert knowledge such as how to choose large and small parameters in [14]-[16].

Recently, there is an increasing interest in investigating the parameter setting for VSG by using optimization-based approach, where the adjustment of parameters is driven by optimal solutions. For example, the stability of a microgrid with multi-VSGs is assessed based on the voltage angle deviations [

20]. Therein, the particle swarm optimization is employed to tune the control parameters of each VSG in real time to achieve smooth transition after disturbances and limit the voltage angle deviations within a special range. The small-signal angular stability of a power system composed of the VSG subsystem and the other subsystems is investigated in [21], where the modal proximity-based approach is presented to guide the parameter design of the VSG. The concept of the linear-quadratic regulator-based control is proposed to find the optimal inertia constant for single VSG in [22], which is further extended to multi-VSGs in [23]. By using this approach, the trade-off between the critical frequency limits and control cost can be achieved. The aforementioned optimization-based approaches have made outstanding contributions to the design of control parameters for VSGs based on different requirements of power system stability. Nevertheless, these approaches are built upon small-signal modeling approach with linearization procedure and simplified mathematical model. Note that the system stability is affected by not only VSG but also other components, e.g., SG, line parameters, load conditions, etc. The interaction between the VSG and its working environment (the whole system) is ignored in the existing research [13]-[18], [20]-[23]. To address this issue, one way is to establish the dynamics of the whole system, analyze the interaction between the VSG and the power system, and then design the corresponding control strategy for VSG. However, it is a very difficult task to establish an exact model of the whole power system under complex interconnected structure. Although such system model can be built in some special cases, it is featured with high-order, nonlinear and strong coupling in general. As a result, it is also difficult for engineers to analyze the impact of VSG on the power system stability and design the corresponding optimal control strategy with a variety of uncertain system disturbances. In addition, different systems may have different structures and components. This also means that each VSG may work in different environments. As a result, the control strategy for VSG by using exact system model may not be universal. Based on those above-mentioned discussions, it is an open problem and challenge in the field of adaptive control for VSG to design a universally optimal control strategy for VSG, which can only use observed data without building the whole system model, i.e., the model-free fashion.

Thanks to the rapid evolvement of artificial intelligence technology, the reinforcement learning approaches enable to find the optimal control policy by only using data interaction between agent and unknown environment, which can be considered as a promising approach to deal with the aforementioned challenge. Up to now, a lot of reinforcement learning algorithms have been proposed [

24]-[29]. Among them, the most popular approaches include the deep Q-network (DQN) algorithm [26] and deep policy gradient (DPG) algorithm [28], which are obtained by successfully combining the deep neural networks (DNN) with the typical Q-learning algorithm [24] and the policy gradient algorithm [25], respectively. Based on different application scenarios, DQN and DPG are suitable for solving different reinforcement learning tasks. Specifically, DQN can better handle the case with continuous observation spaces and discrete action spaces, while DPG fits well with both continuous observation and continuous action spaces. In this paper, we consider continuous state observations, e.g., the variations of active power output as well as angular frequency, to achieve continuous control operations for VSG. Thus, the concept of DPG is more suitable for our work.

Mainly with the aforementioned inspirations, the paper investigates the optimal and adaptive control problem for VSG in model-free scenario, where a decentralized deep policy gradient (DDPG) algorithm is developed and employed to solve this problem. The DDPG is obtained by using the decentralized stochastic gradient descent approach [

30] to replace the stochastic gradient descent approach in classical DPG algorithm for improving the convergence speed. The major contributions of this paper are summarized as follows.

1) The optimal and adaptive control problem for VSG is formulated and transformed into a reinforcement learning task. Therein, the expected performance to achieve multiple control targets for angular frequency and active power regulations are simultaneously considered in the designed optimization target.

2) A data-driven optimal control policy is designed and embedded into the VSG controller based on the DDPG algorithm. It enables the IBDG to adaptively respond to system disturbances and obtain expected performance with the maximum long-term return in model-free fashion.

The remainder of this paper is organized as follows. Section II introduces VSG control, identifies its control variables as well as observation variables, and presents the unknown system dynamics. In Section III, multiple characteristic functions are defined to formulate the expected control targets. Subsequently, the optimal control problem is transformed into a reinforcement leaning task, which is further solved by introducing the DDPG algorithm. Several case studies are provided to verify the effectiveness of the proposed approach in Section IV. Finally, Section V concludes this paper.

II. System Model and Problem Formulation

A simplified diagram of power system is shown in the upper-right corner of Fig. 1, where IBDG as well as other components are integrated into the system. There are multiple configurations for power systems. Meanwhile, the IBDG does not know the system structure as well as the system model. The control diagram of the IBDG is shown in the upper-left corner of Fig. 1. Therein, the concept of VSG control is embedded into the active power control loop to improve the angular frequency stability. Meanwhile, the terminal voltage of IBDG is directly controlled through a proportional-integral (PI) controller to maintain the terminal voltage at the nominal value [

31], [32]. The variables in Fig. 1 will be defined in the following text.

Fig. 1 Overall structure, control, decision and leaning process.

The emulated swing equation of the VSG controller is adopted as:

P_{i n} - P_{o u t} = 2 \tilde{H} ω_{n} \frac{d ω}{d t} + \tilde{D} (ω - ω_{g})

(1)

where $P_{i n}$ is the emulated mechanical power; $P_{o u t}$ is the output active power after low-pass filtering; $ω_{n}$ is the nominal system angular frequency; $ω$ is the virtual angular frequency of the corresponding IBDG; $ω_{g}$ is the angular frequency measured by the phase-locked loop (PLL); and $\tilde{H}$ and $\tilde{D}$ are the virtual inertia and damping factor, respectively.

According to the system frequency deviation, the governor is implemented to adjust the input power command, i.e., $P_{i n}$ , which adopts the $ω$ -P droop controller as follows:

P_{i n} = P_{r e f} - k (ω - ω_{n})

(2)

where $P_{r e f}$ and $k$ are the reference active power and droop coefficient, respectively. The choice of $k$ is determined by standard approach [

33], which reflects the change of

P_{i n}

with respect to the angular frequency.

Unlike the droop coefficient, the choice of virtual inertia and damping factor is more flexible without special restrictions. Thus, we can adaptively adjust the two controllable parameters over time to obtain the expected performance. Note that increasing or decreasing the control parameters may result in different influences on the dynamic characteristics of the active power output and angular frequency in different system environments.

As a grid-forming converter control, the inertial control performance of a VSG depends on both the control parameter design and power system frequency response $ω_{g}$ . Hence, in order to optimally design the VSG frequency control, the frequency response model of a complex power system should be considered. On one hand, accurate modeling of power system frequency response requires global information on governor data and generator inertial constants from multiple stations, which is difficult to obtain for local converter control design. On the other hand, the conventional power system frequency response model can no longer describe the frequency trajectory of a power system with high penetration level of renewable energy, which suffers more from deteriorated system frequency profile. Various energy sources including wind turbine generators, PV generators, and battery energy storage systems, have modified the electromechanical behavior of the original power system. Therefore, considering these two aspects, the data-driven control strategy is needed to be developed to optimally adjust VSG control design with the absence of power system model.

For each IBDG, there are two control parameters considered to be adjusted at time $t$ , which is denoted by a_t:

a_{t} = {{\tilde{H}}_{t}, {\tilde{D}}_{t}}

(3)

To show the dynamic performance, each IBDG is equipped with a VSG controller to observe its real-time states of the output active power, angular frequency, and the derivative of angular frequency, i.e., $P_{o u t, t}$ , $ω_{t}$ , and $d ω_{t} / d t$ . The set of all observations at time $t$ is defined as s_t:

s_{t} = {P_{o u t, t}, ω_{t}, d ω_{t} / d t}

(4)

Note that the adaptive parameter adjustment $a_{t}$ is based on the control policy $u (\cdot)$ to be designed and the observed system states $s_{t}$ . In this paper, a deterministic control policy $u (\cdot)$ is defined as the following function, which maps $s_{t}$ to $a_{t}$ :

a_{t} = u (s_{t})

(5)

The nonlinear state-space equation of the whole system in an implicit form can be written as:

{\dot{x}}_{t} = f (x_{t}, a_{t}, d_{t})

(6)

where $x_{t}$ is the vector of all the state variables, e.g., $P_{o u t}$ , $ω_{t}$ , output current and voltage of each IBDG, output frequency, active power of each SG, etc.; and $d_{t}$ is the uncertain disturbance or variable such as the sudden change of active power reference and load demand, etc. Equation (6) provides a learning environment for the VSG controller. Note that (6) is unknown, which is hard to be modeled with explicit expression. In this paper, we do not need to know the explicit mathematical model of (6). Driven by data, the VSG controller interacts with the environment to obtain the optimal control policy $u (\cdot)$ , which will be discussed in the next section in details.

III. Transformation and Solution

It is worth noting that the studied problem in this paper satisfies the Markov property [

34]. It means that given the current state and action, the next state is independent of all the previous states. The deep reinforcement leaning algorithm can well tackle the Markov decision processes without relying on a model of the probability distributions underlying the state transitions, which fits well with our work. To get the data-driven adaptive VSG controller, we firstly transform the studied optimal control problem into a reinforcement leaning task. Then, the DDPG algorithm is employed to find the optimal control policy. The overall decision process for the data-driven VSG controller and the leaning diagram for the DDPG algorithm are shown in Fig. 1(b) and 1(c), respectively.

A. Formulation of Reinforcement Learning Task

For a reinforcement learning task, three key elements need to be defined, i.e., observation state, action, and reward. In this paper, the observation state and action correspond to $s_{t}$ and $a_{t}$ shown in (4) and (5), respectively. As shown in Fig. 1, the VSG controller interacts with the power system, i.e., learning environment, which is named as power system environment to avoid ambiguity. At each time $t$ , the power system environment provides an observation of $s_{t}$ to the VSG controller. The VSG controller performs an action from the action space based on policy $u (\cdot)$ , and then observes the immediate reward $r (t)$ to update the value of the state-action pair. Next, the interactions of data-driven VSG controller and power system environment via exploration and improvement during the learning process lead the data-driven VSG controller to obtain the approximated optimal control policy. In this paper, we mainly focus on the regulations of angular frequency and active power output. The design of reward $r (t)$ is based on the immediate responses of $ω_{t}$ , $P_{o u t, t}$ , and $d ω_{t} / d t$ after disturbances.

With regard to the frequency regulation, the occurrence of poorly damped oscillation is not designed. Define $ψ_{ω} = | ω_{t} - ω_{n} |$ as the absolute value for angular frequency deviation and $ψ_{ω}^{m a x}$ as the preset upper bound of $ψ_{ω}$ . There are two cases, $i . e .$ , $ψ_{ω} \leq ψ_{ω}^{m a x}$ and $ψ_{ω} > ψ_{ω}^{m a x}$ , that need to be considered separately. For the case $ψ_{ω} \leq ψ_{ω}^{m a x}$ , although the frequency deviation is within the allowable limits, we expect the frequency deviation to be as small as possible and the corresponding settling time to be as short as possible. To achieve this goal, we can set a small penalty item for $ψ_{ω}$ to assess the immediate frequency deviation. Moreover, the larger $ψ_{ω}$ becomes, the bigger the penalty is. For another case $ψ_{ω} > ψ_{ω}^{m a x}$ , the system undergoes huge security risk. Thus, to reduce the occurrence of this situation, we should add a very big penalty once $ψ_{ω} > ψ_{ω}^{m a x}$ . Based on the aforementioned discussion, the characteristic function for the deviation of angular frequency is defined as:

C (ω_{t}) = \{\begin{array}{l} ϱ_{ω} ψ_{ω} ψ_{ω} \leq ψ_{ω}^{m a x} \\ ρ_{ω} ψ_{ω} > ψ_{ω}^{m a x} \end{array}

(7)

where $ϱ_{ω}$ and $ρ_{ω}$ are the small and big penalty coefficients, respectively.

Note that one major functionality of VSG control is to obtain slow electromechanical dynamics like the SG. In other word, a better transient process should contribute to the reduced rate of change of frequency (ROCOF). To this end, the characteristic function for the change rate of angular frequency is defined as:

C (d ω_{t} / d t) = ϱ_{d ω} | d ω_{t} / d t |

(8)

where $ϱ_{d ω}$ is a small penalty coefficient.

For the characteristic of active power output, it is also expected to obtain well-damped oscillation. Similar to the functionality of the first part of (7), the characteristic function for the deviation of active power output is defined as:

C (P_{o u t, t}) = ϱ_{P} ψ_{P}

(9)

where $ϱ_{P}$ is the corresponding penalty coefficient; and $ψ_{P} =$ $| P_{o u t, t} - P_{r e f} |$ is the absolute value for the deviation of active power output. Since $P_{r e f}$ may change greatly due to the intermittent renewable energy resources, e.g., wind and solar, it is not important to limit the upper bounder of $ψ_{P}$ during the transient process. Moreover, the capacity of the inverter is selected so that the headroom is available for necessary inertial support.

According to the expected performance and the characteristic function defined above, the reward at time $t$ is denoted by r_t as:

r_{t} = - b_{ω} C (ω_{t}) - b_{d ω} C (d ω_{t} / d t) - b_{P} C (P_{o u t, t})

(10)

where $b_{ω} > 0$ , $b_{d ω} > 0$ , and $b_{P} > 0$ are the weight coefficients. By choosing different weight coefficients, different output characteristics can be obtained.

Note that the dynamic performance of the active power and angular frequency regulation is measured by a relatively long time reward. For example, we consider a case where a sudden change in load happens at $t_{0}$ , resulting in large frequency oscillation. The beginning to the end of the frequency oscillation corresponds to a time interval. Whether the dynamic performance gets better or not depends on cumulative penalties for long time response but not for one moment only. To this end, the return from state $s_{t}$ is further defined as the cumulative future rewards $R_{t}$ , whose mathematical expression is given by:

R_{t} = \sum_{k = t}^{T} γ^{k - t} r_{k}

(11)

where T is the total time; and $γ$ is the discount factor.

Then, after making an observation $s_{t}$ and executing an action $a_{t}$ , the action value function under the control policy $u (\cdot)$ is the expected return defined as $Q^{u}$ :

Q^{u} (s_{t}, a_{t}) = E [R_{t} | s_{t}, a_{t}, u (\cdot)]

(12)

where $E$ denotes the expected value of $R_{t}$ . Our objective becomes finding the optimal control policy $u^{*} (\cdot)$ that maximizes the expected return from the start of the disturbance.

B. DDPG Algorithm

As stated in Section II, both the system observation state and action are continuous. To account for this attribute, the concept of DPG algorithm based on actor-critic architecture is adopted and further extended in this paper. More importantly, we focus on adopting the decentralized stochastic gradient descent approach to replace the stochastic gradient descent approach in the learning process of traditional DPG algorithm, which is further referred to as DDPG algorithm. By using the DDPG algorithm, the global computation process can be divided into individual computation unit, resulting in faster convergence process. It is assumed that there are $κ$ computation units. The information sharing among the computation units is described by a graph $G = (𝒱, ℰ, W)$ , where $𝒱 = {j = 1,2, \dots, κ}$ is the set of nodes representing the computational units; $𝒱 \subset ℰ \times ℰ$ represents the available communication links; $W = {w_{j \tilde{j}}} \in R^{κ \times κ}$ is the associated adjacency matrix, and $\tilde{j}$ is the neighbor node of j. It is assumed that graph $G$ is undirected and connected. To achieve experience replay, the experiences $e_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})$ at each time step $t$ will be stored in a data set D, which is accessible to every computation unit.

The overall block diagram exhibiting the realization of the policy updating based on the distributed DDGD algorithm is presented in Fig. 1. The actor function is employed to estimate the policy, which maps the observation state of the current power system environment to a specific action deterministically. The critic function is employed to estimate the action value function, in which the output of the actor is fed as one of inputs of the critic. Two neural networks referred to as actor network and critic network are used to approximate the actor and critic functions with parameters $θ^{u}$ and $θ^{Q}$ , respectively. In this scenario, the control policy $u (s_{t})$ parameterized by $θ^{u}$ in the actor network is rewritten as $u (s_{t} | θ^{u})$ . Meanwhile, the action value function $Q^{u} (s_{t}, a_{t})$ parameterized by $θ^{Q}$ in the critic network is represented by $Q^{u} (s_{t}, a_{t} | θ^{Q})$ . Additionally, similar to [

26], the separate target networks are used to stabilize the reinforcement leaning algorithm. The updating for the parameters in target networks slowly tracks the actor and critic networks, denoted as

θ^{u^{'}}

and

θ^{Q^{'}}

, respectively. It has been widely verified that learning without target networks does not perform well in many reinforcement learning tasks. For the reinforcement learning task, the exploration in continuous action spaces is important and necessary. In this paper, we employ the exploration policy by adding a random Gaussian disturbance/noise

δ_{t} = 𝒩 (0, σ_{t}^{2} I)

to the actor policy [35], where

σ_{t}^{2}

is the variance, and

a_{t} = u (s_{t} | θ^{u}) + δ_{t}

. Note that the random noise is persistently exciting. To obtain effective learning, we often set a large noise during the early learning stages, since no reliable knowledge has been learned by the VSG agent. Thus, more explorations are needed. Later, the magnitude of the noise should be gradually reduced so that the VSG agent can effectively use the accumulated experience to select the action and obtain larger cumulative rewards. To capture this concept, the exponential damping is further employed for

σ_{t}

, whose mathematical expression is given by:

σ_{t} = e x p (- ℑ t)

(13)

where $ℑ$ is the decay rate.

Define $θ_{j}^{Q}$ and $θ_{j}^{u}$ as the estimated actor network parameters of the j^th computation unit, and $θ_{j}^{Q^{'}}$ and $θ_{j}^{u^{'}}$ as the corresponding parameters in target networks. The loss functions used to update the critic and actor network parameters are given by:

L (θ^{Q}) = \frac{1}{κ} {\sum_{j = 1}^{κ} L}_{j} (θ_{j}^{Q})

(14)

L_{j} (θ_{j}^{Q}) = E_{(s_{i}, a_{i}, r_{i}, s_{i + 1}) \sim D} [(r_{i} + γ Q^{u^{'}} (s_{i + 1}, u^{'} (s_{i + 1} | θ_{j}^{u^{'}}) | θ_{j}^{Q^{'}} - Q^{u} (s_{i}, a_{i} | θ_{j}^{Q})^{2}]

(15)

In this paper, multiple computation units cooperate to train $θ^{Q}$ . At each step, to minimize (14), every computation unit samples random mini-batch of experiences $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from the memory pool D to compute local stochastic gradient denoted by $\nabla_{θ_{j}^{Q}} L_{j} (θ_{j}^{Q})$ . $θ_{j}^{u}$ is updated by applying the chain rule to maximize the expected return. Specifically, the mathematical expression of the action gradient using samples for approximating is given by:

\nabla_{θ_{j}^{u}} J (θ_{j}^{u}) \approx E_{s_{i}} [\nabla_{a} Q^{u} (s_{i}, a | θ_{j}^{Q}) |_{a = u (s_{i} | θ_{j}^{u})} \nabla_{θ_{j}^{u}} u (s_{i} | θ_{j}^{u})]

(16)

where J is the approximate value function. The parameters $θ_{j}^{Q}$ and $θ_{j}^{u}$ are further updated via local computation based on the information of its own and that of the neighbors:

θ_{j}^{Q} \leftarrow \sum_{\tilde{j} = 1}^{κ} w_{j \tilde{j}} θ_{\tilde{j}}^{Q} - ζ_{Q} \nabla_{θ_{j}^{Q}} L_{j} (θ_{j}^{Q})

(17)

θ_{j}^{u} \leftarrow \sum_{\tilde{j} = 1}^{κ} w_{j \tilde{j}} θ_{\tilde{j}}^{u} - ζ_{u} \nabla_{θ_{j}^{u}} J (θ_{j}^{u})

(18)

where $ζ_{Q}$ and $ζ_{u}$ are the learning rates. Finally, we can obtain $θ^{u}$ and $θ^{Q}$ by using the averaged value of $θ_{j}^{Q}$ and $θ_{j}^{u}$ for all $j \in 𝒱$ .

Based on the current action, the VSG controller will change its control parameters. Then, new transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ will be generated, which is used to update the parameters $θ^{Q}$ and $θ^{u}$ . Correspondingly, the control policy $u (s_{t} | θ^{u})$ is updated. After that, the one-step learning process is finished. The detailed learning process based on DDPG algorithm to find the optimal control strategy is presented in Algorithm 1. Note that the DDPG algorithm is employed to train the data-driven VSG controller offline. After that, the well-trained controller can be used in online applications.

Algorithm 1 : DDPG algorithm

Input: Adjacency matrix $W$ ; learning rates $ζ_{Q}$ , $ζ_{u}$ ; mini-batch size C; number of episodes $M$ ; probability $ε$ ; smoothing factor $τ$

Output: Optimal control policy $u (s_{t} | θ^{u})$

Initialize: Randomly initialize weights $θ_{j}^{Q}$ and $θ_{j}^{u}$ for critic network and actor network $\forall j \in V$ ; initialize weights $θ_{j}^{Q^{'}} \leftarrow θ_{j}^{Q}$ and $θ_{j}^{u^{'}} \leftarrow θ_{j}^{u}$ for target network $\forall j \in V$ ; initialize replay buffer D

1 for $e p i s o d e = 1,2, \dots, M$ do

2 Initialize a random disturbance for control behavior exploration

3 Receive initial initial observation state $s_{1}$

4 for $t = 1,2, \dots, T$ do

5 Select action $a_{t} = u (s_{t} | θ^{u}) + δ_{t}$ based on current policy and exploration noise $δ_{t}$

6 Calculate reward using (10)

7 Observe the new state $s_{t + 1}$

8 Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ into D

9 for $j = 1,2, \dots, κ$ do

10 Randomly sample mini-batch of M transitions $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from D

11 Calculate stochastic gradient $\nabla L_{j} (θ_{j}^{Q})$ by minimizing function (15)

12 Calculate sampled policy gradient according to (16)

13 Update the estimated critic network parameter $θ_{j}^{Q}$ by the j^th computation unit according to (17)

14 Update the estimated actor network parameter $θ_{j}^{u}$ by the j^th computation unit according to (18)

15 Update target networks using (19) and (20)

16 end for

17 Update critic network parameter using (21)

18 Update actor network parameter using (22)

19 end for

20 end for

θ_{j}^{Q^{'}} \leftarrow τ θ_{j}^{Q} + (1 - τ) θ_{j}^{Q^{'}}

(19)

θ_{j}^{u^{'}} \leftarrow τ θ_{j}^{u} + (1 - τ) θ_{j}^{u^{'}}

(20)

θ^{Q} = \frac{1}{κ} \sum_{j = 1}^{κ} θ_{j}^{u}

(21)

θ^{u} = \frac{1}{κ} \sum_{j = 1}^{κ} θ_{j}^{u}

(22)

Remark: Compared with the DPG algorithm, the decentralized stochastic gradient descent approach is embedded into the DDPG algorithm. With this effort, the DDPG algorithm can simultaneously employ multiple computation units to train the neural network parameters as shown in (19)-(22), resulting in faster convergence speed than the traditional DPG algorithm. In this paper, the reinforcement learning task is designed for the VSG controller of individual IBDG. It also means that all those parallel computation units are cooperative to train one VSG controller as shown in Fig. 1. To reduce the training time, this paper employs the DDPG algorithm.

IV. Simulation Results

In this section, we focus on verifying the effectiveness and feasibility of the DDPG algorithm with simulations in a modified IEEE 14-bus test system [

32]. The topology of the modified test system is shown in Fig. 2. It is composed of two synchronous generators, one 2.5 MW IBDG installed at bus 14, twelve loads, and one load disturbance. Therein, the load disturbance is located at bus 4 with green arrow for the sake of distinction. One of the initiatives of integrating VSG into the power system is to mitigate the deteriorated system frequency regulation resulting from high penetration level of renewable energy. Hence, the system frequency transients after load disturbances are considered to train the VSG controller. To simulate the disturbances, we let the load disturbance randomly change within interval [0.2, 1.4]MW. Meanwhile, the active power reference is randomly chosen within interval [0.5, 1.8]MW. We consider four computation units interconnected with each other to form a ring communication network. In order to maintain sufficient rotor-angle stability margin, as a grid-forming control approach, the line impedance should be considered for the design of VSG system. As a result, the imitated rotor angle of VSG should be sufficiently small at rated power, such that the proposed VSG is able to ride through certain system faults during operation [19]. The simulations are conducted in MATLAB/Simulink software. Next, the first case study focuses on training the actor and critic neural networks to obtain the optimal control policy. The performance evaluation of the well-trained VSG controller will be tested after load disturbance and active power change in the second and third case studies, respectively.

Fig. 2 Modified IEEE 14-bus test system.

A. Training Neural Networks and Comparison

In this case study, the adopted structures of the actor and critic neural networks are shown in Fig. 3. The critic network consists of the state path, action path, and common path. Therein, the observations and the actions are the inputs for state path and action path. The outputs of the state path and action path will be combined into one layer which are also the inputs of the common path. The output of the common path is the estimated action value function. For the actor network, the corresponding inputs and outputs are the observations and actions, respectively. The terms ReLU and tanh mentioned in Fig. 3 are the standard activation functions for neurons, which are widely used in the design of deep neutral network. Specifically, $R e L U$ and $t a n h$ are the rectified linear unit function and hyperbolic tangent function, respectively, whose explicit formulations are given by:

\{\begin{array}{l} R e L U (x) = m a x (0, x) \\ t a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} \end{array}

(23)

Fig. 3 Structures of critic and actor networks.

Moreover, the fully connected layer multiplies the input by a weight matrix and then adds a bias vector. The scaling layer is used for scaling the input variables. The rest of simulation parameters are listed in Table I. The DDPG algorithm is trained over $M = 350$ episodes by using 31 h 23 min and 32 s. The cumulative reward for each episode, named as episode reward, is shown in Fig. 4(a). It can be observed that there is no obvious improvement for the episode reward from the episode numbers 250 to 350. This implies that the DDPG becomes stable. Thus, the training can be stopped after 350 episodes. Meanwhile, the parameters for the actor and critic neural networks are saved, and then the optimal control policy is obtained. In addition, during the learning process, there are no requirements for any expert experience or the whole system model. We can obtain the optimal control policy based on numerous explorations and improvements driven by observation data only. Finally, the optimal control policy is embedded into the VSG controller resulting in well-trained VSG controller.

Table I Training Parameters

Parameter	Value	Parameter	Value
$ϱ_{ω}$	10	$γ$	0.9
$ϱ_{d ω}$	2	$ζ_{Q}$	0.001
$ϱ_{P}$	2	$ζ_{u}$	0.0005
$ρ_{ω}$	1000	$τ$	0.001
$b_{ω}$	1/3	$ψ_{ω}^{m a x}$	$2 π \times 0.8$ Hz
$b_{d ω}$	1/3	$I$	0.001
$b_{P}$	1/3

Fig. 4 Cumulative reward for each episode. (a) DDPG algorithm. (b) DPG algorithm.

Next, the traditional DPG algorithm is employed to solve the same problem, which can be seen as a special case of the DDPG algorithm with one computation unit, i.e., $κ = 1$ . Meanwhile, the decentralized stochastic gradient descent approach is changed into the stochastic gradient descent approach during back propagation. With the same neural network structures and parameters, the episode reward obtained by using the DPG algorithm is shown Fig. 4(b). The total training time is 68 h 17 min and 25 s, which is longer than that using DDPG algorithm. In addition, it can been observed from Fig. 3(a) and 3(b) that the DDPG algorithm requires fewer episodes than the DPG algorithm to achieve the similar episode reward. These results exhibit the faster convergence feature of the DDPG algorithm. This is because the DDPG is able to use multiple computation units simultaneously to accelerate the training process.

B. Load Disturbance

In this case study, we aim at verifying the effectiveness of the well-trained VSG controller under load disturbance. At $t = 20$ s, a 0.7 MW load disturbance is added in the test system. The simulation results are shown in Figs. 5 and 6. It can be observed that the IBDG can respond to the load disturbance adaptively and automatically. Specifically, the maximum angular frequency deviation is $2 π \times 0.42$ Hz, which is within the preset upper bound of $ψ_{ω}^{m a x} = 2 π \times 0.8$ Hz. Meanwhile, the frequency changes relatively slow, i.e., with small ROCOF. As a result, the IBDG possesses slow frequency drop, which meets the major functionality of VSG control. Moreover, the oscillation of active power output is also well damped. Note that by implementing the well-trained VSG controller, the tradeoff between the frequency response and active power output can be achieved and maintained as desired, which fulfills the expected performance discussed in Section III-A. This is because the design of immediate reward provides the penalty for bad performance. Then, driven by the stimulation of long-term return, satisfactory results can be obtained. In addition, the secondary frequency control is not included and the frequency deviation at system steady state relates to predefined droop parameters of individual generation unit. Based on the above-mentioned discussions, it can be concluded that the well-trained VSG controller possesses good adaptability and performs well after load disturbance.

Fig. 5 Frequency response after load disturbance.

Fig. 6 Active power output of IBDG after load disturbance.

C. Change of Power Reference

In this case study, the focus is on testing the effectiveness of the well-trained VSG controller after the change of active power reference. At $t = 20$ s, there is a step change for active power reference from 0.7 p.u. to 0.5 p.u.. The simulation results for the frequency response and active power output of the IBDG are shown in Figs. 7 and 8, respectively.

Fig. 7 Frequency response after change of power reference.

Fig. 8 Active power output of IBDG after change of power reference.

As observed, both the frequency and active power output gradually converge to a new stable equilibrium with well-damped oscillations, and the system ROCOF is mitigated. Thus, the expected performance targets are fulfilled. This implies that the well-trained VSG controller exhibits better adaptability and works well after the change of power reference.

D. Performance Test in a New Test System

In this case study, the performance of the well-trained VSG controller obtained from the first case study is further tested in a new IEEE 14-bus test system, which is different from that used in offline training. Specifically, the SG at bus 2 is replaced with an IBDG and the IBDG at bus 14 is disconnected. Referring to the structure of IEEE 14-bus test system, three synchronous condensers are commissioned at bus 3, bus 8, and bus 6, respectively. By replacing the system SG with IBDG and integrating synchronous condensers, the equivalent inertial constant and frequency response model of the system are inevitably changed. At time $t = 20$ s, a 0.4 MW load disturbance at bus 4 is added in the test system. The comparative system frequency responses and active power outputs with different converter controls after load disturbance are shown in Figs. 9 and 10, respectively.

Fig. 9 Frequency responses with different converter controls after load disturbance.

Fig. 10 Active power responses with different converter controls after load disturbance.

Typically, the grid-following converter control approach does not participate in power system frequency regulation, where it simply follows the system frequency through PLL. Both droop converter control and VSG are able to participate in the power system frequency regulation and enhance the system small-signal stability due to their grid-forming nature. Furthermore, the proposed data-driven VSG control is able to better arrest the ROCOF of power system and provide necessary inertial control. Meanwhile, the oscillation of active power output is also well damped. Note that it is impossible for the data-driven VSG controller to be trained in all transient scenarios.

Next, we further test the performance of the proposed VSG controller after fault transient. The system dispatching scenario is the same as that presented in Figs. 9 and 10. At $t = 20$ s of the simulation time, a fault is introduced at the transmission line that connects bus 2 and bus 3. The fault lasts for 10 cycles and trips the transmission line. The simulation results with different converter controls are shown in Fig. 11. Note that the well-trained VSG controller is not trained in the fault transient scenario or in the new test system. Thus, the optimality of the convergence results cannot be guaranteed. However, taking advantage of the introduced virtual inertia from VSG, the power system stability can be enhanced, where the frequency deviation of the power system integrated with VSG is less than the other two cases in Fig. 11, and the synchronism [

19] is better preserved.

Fig. 11 Frequency responses with different converter controls after fault transient.

The simulation results show that the VSG controller also works well in the new test system. However, the better performance cannot always be ensured in any kind of new systems, since it is not trained in the new environment. In practical application, the VSG controller requires re-training if used in different systems.

V. Conclusion

This paper investigates the adaptive and optimal control problem for VSG. To achieve the expected control performance target for frequency regulation and active power regulation, multiple characteristic functions are defined and further used to form the immediate reward. With this effort, the optimal control problem is finally formulated as a reinforcement learning task. To handle this task, the DDPG algorithm is employed to learn the optimal control policy with the objective of maximum long-term return. The implementation of the DDPG algorithm does not need any expert knowledge and does not rely on the system model. Thus, we can obtain the optimal control policy in a model-free fashion, which is the major advantage compared with the existing optimal control approaches used in VSG. In the future, the voltage stability and further application of the DDPG algorithm will be considered.

REFERENCES

H. Zhang, Y. Li, D. W. Gao et al., “Distributed optimal energy management for energy internet,” IEEE Transactions on Industrial Informatics, vol. 13, no. 6, pp. 3081-3097, Dec. 2017. [Baidu Scholar]

J. Zhou, Y. Xu, and H. Sun, “Distributed power management for networked AC/DC microgrids with unbalanced microgrids,” IEEE Transactions on Industrial Informatics, vol. 16, no. 3, pp. 1655-1667, Mar. 2020. [Baidu Scholar]

Y. Li, H. Zhang, X. Liang et al., “Event-triggered based distributed cooperative energy management for multienergy systems,” IEEE Transactions on Industrial Informatics, vol. 15, no. 14, pp. 2008-2022, Apr. 2019. [Baidu Scholar]

Y. Li, D. W. Gao, W. Gao et al., “Double-mode energy management for multi-energy system via distributed dynamic event-triggered Newton-Raphson algorithm,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5339-5356, Nov. 2020. [Baidu Scholar]

R. Wang, Q. Sun, D. Ma et al., “The small-signal stability analysis of the droop-controlled converter in electromagnetic timescale,” IEEE Transactions on Sustainable Energy, vol. 10, no. 3, pp. 1459-1469, Jul. 2019. [Baidu Scholar]

Z. Yi, Y. Xu, W. Gu et al., “A multi-time-scale economic scheduling strategy for virtual power plant based on deferrable loads aggregation and disaggregation,” IEEE Transactions on Sustainable Energy, vol. 11, no. 3, pp. 1332-1346, Jul. 2020. [Baidu Scholar]

J. Zhou, Y. Xu, H. Sun et al., “Distributed event-triggered [Baidu Scholar]

consensus based current sharing control of DC microgrids considering uncertainties,” IEEE Transactions on Industrial Informatics, vol. 16, no. 12, pp. 7413-7425, Dec. 2020. [Baidu Scholar]

Y. Li, D. W. Gao, W. Gao et al., “A distributed double-Newton descent algorithm for cooperative energy management of multiple energy bodies in energy internet,” IEEE Transactions on Industrial Informatics, doi: 10.1109/TII.2020.3029974 [Baidu Scholar]

Q. Zhong and G. Weiss, “Synchronverters: inverters that mimic synchronous generators,” IEEE Transactions on Industrial Electronics, vol. 58, no. 4, pp. 1259-1267, Apr. 2011. [Baidu Scholar]

Q. Zhong, “Virtual synchronous machines: a unified interface for grid integration,” IEEE Power Electronics Magazine, vol. 3, no. 4, pp. 18-27, Dec. 2016. [Baidu Scholar]

J. Chen and T. O’Donnell, “Parameter constraints for virtual synchronous generator considering stability,” IEEE Transactions on Power Systems, vol. 34, no. 3, pp. 2479-2481, May 2019. [Baidu Scholar]

Z. Yi, Y. Xu, J. Zhou et al., “Bi-level programming for optimal operation of an active distribution network with multiple virtual power plants,” IEEE Transactions on Sustainable Energy, vol. 11, no. 4, pp. 2855-2869, Oct. 2020. [Baidu Scholar]

J. Lee, G. Jang, E. Muljadi et al., “Stable short-term frequency support using adaptive gains for a DFIG-based wind power plant,” IEEE Transactions on Energy Conversion, vol. 31, no. 3, pp. 6289-6297, Sep. 2016. [Baidu Scholar]

D. Li, Q. Zhu, S. Lin et al., “A self-adaptive inertia and damping combination control of VSG to support frequency stability,” IEEE Transactions on Energy Conversion, vol. 32, no. 1, pp. 397-398, Mar. 2017. [Baidu Scholar]

F. Wang, L. Zhang, X. Feng et al., “An adaptive control strategy for virtual synchronous generator,” IEEE Transactions on Industry Applications, vol. 54, no. 5, pp. 5124-5133, Sept. 2018. [Baidu Scholar]

J. Li, B. Wen, and H. Wang, “Adaptive virtual inertia control strategy of VSG for micro-grid based on improved bang-bang control strategy,” IEEE Access, vol. 7, pp. 39509-39514, Mar. 2019. [Baidu Scholar]

H. Wu, X. Ruan, D. Yang et al., “Small-signal modeling and parameters design for virtual synchronous generators,” IEEE Transactions on Industrial Electronics, vol. 63, no. 7, pp. 4292-4303, Jul. 2016. [Baidu Scholar]

M. Li, W. Huang, N. Tai et al., “A dual-adaptivity inertia control strategy for virtual synchronous generator,” IEEE Transactions on Power Systems, vol. 35, no. 1, pp. 594-604, Jan. 2020. [Baidu Scholar]

H. Wu and X. Wang, “A mode-adaptive power-angle control method for transient stability enhancement of virtual synchronous generators,” IEEE Journal of Emerging and Selected Topics in Power Electronics, vol. 8, no. 2, pp. 1034-1049, Jun. 2020. [Baidu Scholar]

J. Alipoor, Y. Miura, and T. Ise, “Stability assessment and optimization methods for microgrid with multiple VSG units,” IEEE Transactions on Smart Grid, vol. 9, no. 2, pp. 1463-1471, Mar. 2018. [Baidu Scholar]

W. Du, Q. Fu, and H. Wang, “Power system small-signal angular stability affected by virtual synchronous generators,” IEEE Transactions on Power Systems, vol. 34, no. 4, pp. 3209-3219, Jul. 2019. [Baidu Scholar]

U. Markovic, Z. Chu, P. Aristidou et al., “Fast frequency control scheme through adaptive virtual inertia emulation,” in Proceedings of 2018 IEEE Innovative Smart Grid Technologies–Asia, Singapore, Singapore, Mar. 2018, pp. 787-792. [Baidu Scholar]

U. Markovic, Z. Chu, P. Aristidou et al., “LQR-based adaptive virtual synchronous machine for power systems with high inverter penetration,” IEEE Transactions on Sustainable Energy, vol. 10, no. 3, pp. 1501-1511, Jul. 2019. [Baidu Scholar]

W. Cjch and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279-292, May 1992. [Baidu Scholar]

D. Silver, G. Lever, N. Heess et al., “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on Machine Learning, Beijing, China, Jun. 2014, pp. 387-395. [Baidu Scholar]

V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. [Baidu Scholar]

T. P. Lillicrap, J. J. Hunt, A. Pritzel et al. (2019, Jul.). Continuous control with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1509.02971v2 [Baidu Scholar]

J. Schulman, P. Moritz, S. Levine et al. (2018, Oct.). High-dimensional continuous control using generalized advantage estimation. [Online]. Available: https://arxiv.org/abs/1506.02438 [Baidu Scholar]

Y. Li. (2018, Nov.). Deep reinforcement learning: an overview. [Online]. Available: https://arxiv.org/abs/1701.07274 [Baidu Scholar]

X. Lian, W. Zhang, C. Zhang et al. (2018, Sept.). Asynchronous decentralized parallel stochastic gradient descent. [Online]. Available: https://arxiv.org/abs/1710.06952 [Baidu Scholar]

W. Du, Z. Chen, K. P. Schneider et al., “A comparative study of two widely used grid-forming droop controls on microgrid small signal stability,” IEEE Journal of Emerging and Selected Topics in Power Electronics, vol. 8, no. 2, pp. 963-975, Jun. 2020. [Baidu Scholar]

M. I. Jordan and T. M. Mitchell, “Machine learning: trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255-260, Jul. 2015. [Baidu Scholar]

R. Wang, Q. Sun, P. Zhang et al., “Reduced-order transfer function model of the droop-controlled inverter via Jordan continued-fraction expansion,” IEEE Transactions on Energy Conversion, vol. 35, no. 3, pp. 1585-1595, Sept. 2020. [Baidu Scholar]

W. Yan, L. Cheng, S. Yan et al., “Enabling and evaluation of inertial control for PMSG-WTG using synchronverter with multiple virtual rotating masses in microgrid,” IEEE Transactions on Sustainable Energy, vol. 11, no. 2, pp. 1078-1088, Apr. 2020. [Baidu Scholar]

P. Wawrzynski, “Control policy with autocorrelated noise in reinforcement learning for robotics,” International Journal of Machine Learning and Computing, vol. 5, no. 2, pp. 91-95, Apr. 2015. [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher