Abstract
High penetration of distributed renewable energy sources and electric vehicles (EVs) makes future active distribution network (ADN) highly variable. These characteristics put great challenges to traditional voltage control methods. Voltage control based on the deep Q-network (DQN) algorithm offers a potential solution to this problem because it possesses human-level control performance. However, the traditional DQN methods may produce overestimation of action reward values, resulting in degradation of obtained solutions. In this paper, an intelligent voltage control method based on averaged weighted double deep Q-network (AWDDQN) algorithm is proposed to overcome the shortcomings of overestimation of action reward values in DQN algorithm and underestimation of action reward values in double deep Q-network (DDQN) algorithm. Using the proposed method, the voltage control objective is incorporated into the designed action reward values and normalized to form a Markov decision process (MDP) model which is solved by the AWDDQN algorithm. The designed AWDDQN-based intelligent voltage control agent is trained offline and used as online intelligent dynamic voltage regulator for the ADN. The proposed voltage control method is validated using the IEEE 33-bus and 123-bus systems containing renewable energy sources and EVs,and compared with the DQN and DDQN algorithms based methods, and traditional mixed-integer nonlinear program based methods. The simulation results show that the proposed method has better convergence and less voltage volatility than the other ones.
WITH the large-scale integration of distributed renewable energy sources (RESs) such as photovoltaic (PV) generators and wind turbines (WTs) and the massive application of electric vehicles (EVs), the power mismatch between the energy production and energy consumption in active distribution networks (ADNs) becomes highly variable. The resulting dynamic changes in power flows and voltages increase the risk of violating the voltage limits at nodes with RESs and EVs of the ADN [
Existing voltage control methods in ADN with EVs usually provide voltage control by regulating EV charging and discharging actions independently or in coordination with other control resources. For example, EV charging power control for the voltage control in ADN is based on model prediction of EV behaviors and convex optimization methods [
However, the randomness of EV state of charge (SOC) and the temporal and spatial uncertainties of the access of EVs to the grid make the schedulable capacity of EVs stochastic and time-varying. It turns the optimal voltage control model using EVs to a stochastic, mixed-integer, and nonlinear programming problem, whose solutions are time-consuming or even infeasible [
To solve the above problem, distributed and hierarchical voltage control methods are proposed in the literature. In particular, hierarchical control of EVs by using EV aggregators (EVAs) is proposed in [
The model-based voltage control methods mentioned above, however, are highly dependent on prediction results, specific equipment, and optimization models that are difficult to solve accurately. Thus, this class of methods faces difficulties in coping with scenarios of time-varying renewable energy generations, loads, and adjustable capacity of EVs [
To address the problems of model-based control methods, model-free control method using reinforcement learning (RL) algorithm has received increasing attention in power system applications, because it does not require a tedious and complicated modeling process, reduces model inaccuracies caused by experience and bias, does not rely on prediction, and has better portability and adaptability. One of the most representative and widely used RL algorithms is the Q-learning. For example, the voltage control methods using Q-learning algorithm have been used for transformer tap and capacitor switching control for the voltage control in ADN [
However, traditional Q-learning algorithms based on state-action tables are prone to cause the “curse of dimensionality” in scenarios with many states. So, they cannot efficiently cope with voltage control in an ADN with many nodes and variable states. Deep reinforcement learning (DRL) with human-level control, which combines reinforcement learning with deep learning, such as deep Q-network (DQN) algorithm, can solve the problem with many states by estimating the action reward values of different states [
Overestimation or underestimation of actions in DRL can seriously degrade the learning performance. In the DQN algorithm, the same action reward value is used for the selection and evaluation of an action, thus, the problem of overestimation of the action reward value is common [
In summary, although the DQN and DDQN algorithms have been used to solve the problem of voltage control in ADN, few of the existing studies discuss the problem of control performance degradation caused by the action estimation bias of DQN and DDQN algorithms. In contrast, AWDDQN algorithm can overcome the shortcomings of DQN and DDQN algorithms. However, there are few applications of DDQN and AWDDQN algorithms in the field of voltage control in ADNs [
In this paper, an intelligent voltage control method based on AWDDQN algorithm is proposed for ADNs. The main contributions are as follows.
1) In the AWDDQN algorithm, dual weighted estimators are integrated into the DDQN algorithm to overcome the shortcomings of misestimation of action reward values by DQN and DDQN algorithms. The voltage control objective is incorporated into the designed action reward values to form the AWDDQN-based intelligent voltage controller for the ADN, where renewable power generations, loads, and adjustable resources of EVs are time-varying.
2) The capability of EVAs is quantified as adjustable resources for voltage control using the schedulable capacity approach of EVAs, ensuring that EV charging demand is satisfied after participating in dispatch.
3) Comprehensive comparisons of voltage control methods based on DQN, DDQN, AWDDQN, and traditional mixed-integer nonlinear program algorithms are presented, and the impact of inaccurate estimation results of action reward values on the performance of DQN and DDQN algorithms is compared with the AWDDQN algorithm.
The remainder of this paper is organized as follows. Section II discusses the principle of the AWDDQN algorithm. The intelligent voltage control method based on AWDDQN algorithm is described in Section III. The effectiveness of the method is verified using digital simulation in Section IV, and the conclusions are drawn in Section V.
An agent selects an action to act on the environment according to a certain policy in a state of the environment. Then, the environment gives a reward for the action and steps to the next state. This process is usually described in reinforcement learning as a Markov decision process (MDP) model, represented by a tuple , where S is the state space, A is the action space, P is the state transition probability, r is the reward, and is the discount rate. One of the representative methods to solve the MDP model is the Q-learning algorithm, in which the expected value of function for action a taken for the state s in an episode needs to be evaluated. According to the Bellman optimality equation, this function is described as:
(1) |
where is the expected value; is the probability of reaching state s' from state s; and a' is the action in state s'.
The traditional Q-learning algorithm stores the action reward values of each state-action pair in a table and iteratively updates them continuously. However, storing and iterating the values of all state-action pairs are difficult when the state space is very large. To solve this problem, Q-learning algorithm is combined with deep learning to form the DQN algorithm [
The structure of DQN is shown in

Fig. 1 Structure of DQN.
Its input is the current state st () at current time t, and the number of inputs is the dimensions of state space elements NS. Its output contains all the approximate action reward values of functions () in that state st, and the dimension of outputs is that of action space NA. The agent selects the action with the largest reward value in the output layer.
The loss function in DQN is expressed as:
(2) |
where and are the target function and the parameter of at iteration i, respectively. The target function is defined as:
(3) |
where is the maximum value in the state ; ri is the reward at iteration i; is the parameter; and is the action in the DQN algorithm with the maximum approximate value in the state s', as shown in (4).
(4) |
From (3) and (4), we can observe that the action in DQN algorithm is selected and evaluated by the function . This makes it more likely to select overestimated values, resulting in overestimation of action reward values. To address this problem, the target function in DDQN algorithm is modified as:
(5) |
where is the action with the maximum approximate value in the state s', as shown in (6).
(6) |
From (5) and (6), we can observe that the action is selected by and evaluated by in DDQN algorithm. The probability of the action overestimated is greatly reduced by the design of those dual estimators.
However, the separation of selection and evaluation in DDQN algorithm sometimes creates underestimation problems, especially in environments with high stochasticity and uncertainty [
To overcome the above problems, the AWDDQN algorithm [
The structure of AWDDQN algorithm is similar to that of DQN algorithm, except the loss function and the target function. The loss function of the neural network in AWDDQN algorithm can be expressed as:
(7) |
where is the target function in AWDDQN algorithm. Dual weighted estimators are used in the target function, as shown in (8).
(8) |
(9) |
where is the action with the maximum approximate value in the state s'; F is the memorized number of steps; and are the parameters of the approximate value functions of the online network and the target network at the action in the past, respectively; and is the weight, as shown in (10).
(10) |
(11) |
where aL is the action with the minimum approximate value in the state s'; and is the hyperparameter for adjusting weight values.
After the loss function is established, the stochastic gradient descent algorithm is used to update the network parameter , as shown in (12).
(12) |
where is the gradient descent rate.
From (8), it can be observed that the action selection and evaluation are separated by the AWDDQN algorithm, thus avoiding the possible overestimation problem of action reward values of DQN algorithm. Therefore, it averages the F action estimation results learned in the past as the result of the target value, thus avoiding the possible misestimation of action reward values and the stochastic environment instability problem of DDQN algorithm. Using these dual weighted estimators and averaged weighted processing, better training stability and performance are obtained, thus achieving improved decision accuracy.
In the proposed method, the node voltages and the EVA and adjustable resource states constitute the state space, the active and reactive outputs of resources constitute the action space, and the weighted voltage fluctuations constitute the reward. This enables the agent to learn the most favorable output action for voltage control under different states.
The objective function for the voltage control can be expressed as:
(13) |
where Un,t is the
The constraints are as follows:
(14) |
(15) |
(16) |
where Umax and Umin are the allowable upper and lower voltage limits, respectively; and are the active and reactive power of the
In this paper, the adjustable reactive power resources considered are static reactive compensators with constant upper and lower limits. The adjustable active power resources considered are aggregated EVs, whose adjustable upper and lower limits are time-varying.
The capability of EVAs as adjustable resources is measured by the schedulable capacity (EVSC), which is the bidirectional energy and power exchanged by an EVA with the grid at time t without affecting the future use of the EVs [
The calculations of SCC, SCP, SDC, and SDP are shown in (17)-(20).
(17) |
(18) |
(19) |
(20) |
where Cd is the battery capacity of the EV; , , , and are the SCC, SDC, SCP, and SDP of the EV at time t, respectively; and are the maximum charging and discharging power of the EV, respectively; and are the charging and discharging efficiencies, respectively; and are the excepted arriving and leaving time, respectively; ts is the dispatch step; is the initial SOC of the EV on arrival; and and are the SOC of the EV at time t and the minimum required SOC when leaving, respectively.
The SCC, SDC, SCP, or SDP of an EVA is the sum of the SCC, SDC, SCP, or SDP of EVs connected to this aggregator at time t, as in (21)-(24).
(21) |
(22) |
(23) |
(24) |
where is the total number of EVs connected to the EVA at time t.
Equations (
In reinforcement learning, the objective function and constraints of voltage control are normalized to form an MDP model.
In the proposed method, the state space S is the set of all states, and the state set st is the set of states of node voltages and adjustable resources of the ADN at time t, as shown in (25).
(25) |
where is the output active or reactive power of the
These data can be obtained by direct state measurement from the system devices in real systems. In this paper, the states are obtained by power flow calculations after adopting actions to simulate the operation of a real system. Specifically, the state space is divided into three parts, i.e., all node voltage values , the EVSC of all dispatchable EVAs, and the output power of all adjustable resources . If the number of adjustable EVAs is denoted by L, the dimension of state space NS is equal to .
The action space A includes all output cases of all adjustable resources. In this paper, the output cases are represented by the discretized outputs of the adjustable resources. Specifically, the action set at set for the state set st is a certain output action set of all adjustable resources, as shown in (26).
(26) |
where is the
The set of actions can be converted into the output power of all adjustable resources, thus allowing the action space and the state space to be linked, with the conversion formula, as shown in (27).
(27) |
As can be observed from the design of the action space, the output ranges of the adjustable resources should not exceed their limits so that the constraints (15) and (16) are satisfied.
In addition, the action for EVs will be weighted according to the EVSC values of each EV, as shown in (28).
(28) |
where is the dispatched charging power of the EV at time t.
The reward rt obtained by the agent after selecting an action according to the node voltage states at time t is designed to have a negative value, thus, a higher reward means a smaller voltage volatility rate. The objective function can be obtained by accumulating the rewards, and this design makes the objective function (13) incorporated into the rewards, as shown in (29) and (30).
(29) |
(30) |
where is the penalty factor.
As can be observed from (29) and (30), by introducing the penalty factor , for constraint (14), a certain relaxation is achieved, and the penalty factor is higher for the nodes with larger voltage deviations. This allows the agent to prioritize the scheduling of nodes with larger voltage deviations in resource-limited scenarios to prevent the voltage from exceeding limits, and thus satisfy constraint (14).
A dynamic -greedy strategy is used during the agent training. At each iteration, the agent will select a random action with probability or select the action with the maximum reward with probability , while the value of changes. In this paper, a dynamic greedy strategy combined with a simulated annealing method is used [
(31) |
where is the value at the episode; is the cooling down factor, which is a constant greater than 0 and lower than 1; T0 is the initial temperature of simulated annealing; and ar is a uniformly and randomly selected action in state s.
With the dynamic -greedy strategy combined with simulated annealing method, the goal of exploring first and then converging is achieved.
The AWDDQN-based intelligent voltage control processes include the offline training and online voltage control processes, as shown in Algorithm 1. The purpose of offline training process is to teach the AWDDQN-based intelligent voltage control agent how to pick the optimal action for different states.
The AWDDQN-based intelligent voltage control agent is used as an online agent to control the real-time voltages of the ADN after the offline training is completed. The online voltage control process using the AWDDQN-based intelligent voltage control agent is shown in

Fig. 2 Online voltage control process using AWDDQN-based intelligent voltage control agent.
Algorithm 1 |
---|
1. Input: the maximum iteration I, memory buffer size M, mini-batch size B, number of steps in an episode T, number of copy steps C, state s0, c, γ, F, , K, δ, and T0 |
2. Initialize: neural network structure, memory replay buffer, parameters θ0 and |
3. For to I, do |
4. Initialize state s0 |
5. For t = 0 to T, do |
6. Select action at based on the dynamic ε-greedy strategy with the output power of adjustable resources by (26)-(28) |
7. Change distribution network and EVs due to the power output of adjustable resources |
8. Obtain the new state by (25) |
9. Calculate the reward rt by (29) and (30) |
10. Store the tuple {st, at, st+1, rt} in the memory replay buffer |
11. if memory replay buffer is full |
12. Delete the earliest tuple |
13. end if |
14. Sample a mini-batch of B tuples from the memory, and replay buffer to the target and online network |
15. Obtain from the online network |
16. Calculate y |
17. Calculate L(θi) by (7) |
18. Update by (12) |
19. Input to the online network |
20. if |
21. Copy θi+1 to |
22. end if |
23. end for |
24. end for |
25. Output θ |
The trained AWDDQN-based intelligent voltage control agent can decide the optimal output actions of the adjustable resources based on the system state information formed by node voltages, EVSC states, and other unit states.
In practice, the control procedure of EVs consists of two-level hierarchical control. In the upper-level control, the distribution system operator equipped with AWDDQN-based intelligent voltage control agent achieves voltage optimization by coordinating the schedulable capacity needed among the EVAs. In the lower-level control, each EVA coordinates the control of EV charging and discharging behaviors according to the negotiated schedulable capacity in the upper-level control. After receiving the action signal in the upper-level control, the EVAs coordinate the power of EVs by (28) in the lower-level control. Similarly, each reactive unit changes its output power so that the action is completed. In this way, the ADN voltages, the EVSC of EVAs, and the output power of other adjustable resources are updated to new states. Thus, the AWDDQN-based intelligent voltage control agent continuously selects the actions based on the new states. During the online control, the AWDDQN-based intelligent voltage control agent can handle the time-varying environment in real time through observing the states measured and the rewards received after the actual control actions.
Modified versions of two typical distribution systems, i.e., the IEEE 33-bus and the IEEE 123-bus systems, are used as test cases. The settings of power resources and adjustable resources are shown in
Testing system | Connected node | Resource type | Power setting |
---|---|---|---|
IEEE 33-bus | 8, 25 | PV generators | 1.5 MW |
15 | WT units | 1.5 MW | |
18, 23 | EVAs |
Shown in | |
30 | Reactive power resources | [-0.5 Mvar, 0.5 Mvar] | |
IEEE 123-bus | 20, 60, 86 | PV generators | 1.5 MW |
30, 68 | WT units | 1 MW | |
29, 42, 52, 85 | EVAs |
Shown in | |
71, 109 | Reactive power resources | [-0.5 Mvar, 0.5 Mvar] |
The output profiles of WT units and PV generators are shown in

Fig. 3 System information of testing system. (a) Output power profiles of WT units and PV generators. (b) Load profiles. (c) EVSC profiles of EVAs.
From
In the IEEE 33-bus system, the number of nodes N is 33, the number of EVAs L is 2, and the number of adjustable resources J is 3, so the number of state space . The total number of adjustable actions per resource is , so the number of action spaces is . Therefore, the numbers of inputs and outputs to the neural network are 44 and 512, respectively. Similarly, in the IEEE 123-bus system, the numbers of inputs and outputs to the neural network are 140 and 32768, respectively.
The 400-day data for output power of RESs, loads, and EVSCs are generated by using a Monte Carlo method which is set with uniformly random fluctuations from the base values. The step interval ts is 15 min. The same method is used to generate the data in one day as the test set. The training parameters of the AWDDQN algorithm are given in
(32) |
Parameter | IEEE 33-bus system | IEEE 123-bus system |
---|---|---|
Activation function | ReLU | ReLU |
c | 1 | 1 |
γ | 0.99 | 0.99 |
0.001 | 0.001 | |
F | 5 | 4 |
M | 10000 | 20000 |
B | 200 | 200 |
C | 200 | 200 |
I | 288000 | 960000 |
δ | 0.98 | 0.99 |
T0 |
1 |
1 |
In this paper, 24 hours are taken as one episode, and . The calculation of rt is shown in (29) and (30).
The simulation results of the voltage control methods with the DQN, DDQN, and AWDDQN algorithms are obtained and compared using a computer with 3.0 GHz Intel Core i7 CPU, 16 GB RAM, and GTX 2070 graphics card in the MATLAB environment. The training parameters of DQN and DDQN algorithms are the same as those of AWDDQN algorithm. The episode rewards of the three algorithms for the modified IEEE 33-bus system and IEEE 123-bus systems are shown in

Fig. 4 Episode rewards of DQN, DDQN, and AWDDQN algorithms. (a) IEEE 33-bus system. (b) IEEE 123-bus system.
It can be observed from
In the action space, K is the total number of actions per adjustable resource, i.e., the adjustable range of each adjustable resource is divided equally into K values, and the larger K is, the more precisely the resources can be adjusted, and vice versa. Besides, from (9), we can observe that the action is obtained by traversing all actions a', thus the increase of dimensions of the action space delays the selection of the action. The number of dimensions in the action space is

Fig. 5 Training process of AWDDQN algorithm with different K values in IEEE 33-bus system.
According to the above setting for the modified IEEE 33-bus and 123-bus systems, simulations are performed using the voltage control methods equipped with the DQN, DDQN, and AWDDQN algorithms, respectively. In addition, to demonstrate the advantages of the AWDDQN algorithm over traditional algorithms, the algorithm in [

Fig. 6 Node voltages without control. (a) Voltages of IEEE 33-bus system. (b) Voltages of IEEE 123-bus system.
The reasonable range of voltage is [0.95, 1.05]p.u.. In the IEEE 33-bus system, most node voltages rise significantly at noon due to the increased output power of PVs, with a significant voltage over the upper limit near node 25. At night, due to the absence of PV power and increase of loads, there is a significant voltage drop at the end nodes such as node 33, and the voltage drop even below the lower limit. Similarly, there are node voltages outside limits in the IEEE 123-bus system, such as the voltage of node 86.
Figures

Fig. 7 Results of voltage control methods with different algorithms in IEEE 33-bus system. (a) MINLP algorithm. (b) DQN algorithm. (c) DDQN algorithm. (d) AWDDQN algorithm.

Fig. 8 Results of voltage control methods with different algorithms in IEEE 123-bus system. (a) MINLP algorithm. (b) DQN algorithm. (c) DDQN algorithm. (d) AWDDQN algorithm.
It can be observed from Figs.

Fig. 9 Optimal action results of adjustable reactive power resources and node voltage profiles using voltage control methods using different DRL algorithms. (a) IEEE 33-bus system. (b) IEEE 123-bus system.
The results of the two cases show that the control method with the AWDDQN algorithm can select the actions with higher rewards and thus obtain better control results compared with those with the DQN and DDQN algorithms. This proves that the control method with the AWDDQN algorithm is more accurate in evaluating the action reward values.
Testing system | Case | Objective function value | Voltage range | Calculation time (s) |
---|---|---|---|---|
IEEE 33-bus | Initial state | 0.0248 | [0.924, 1.072] | |
MINLP | 0.0092 | [0.962, 1.053] | 213.45 | |
DQN | 0.0103 | [0.948, 1.060] | 0.16 | |
DDQN | 0.0080 | [0.950, 1.051] | 0.17 | |
AWDDQN | 0.0072 | [0.955, 1.020] | 0.23 | |
IEEE 123-bus | Initial state | 0.0358 | [0.919, 1.055] | |
MINLP | 0.0046 | [0.965, 1.045] | 1026.45 | |
DQN | 0.0057 | [0.947, 1.048] | 0.57 | |
DDQN | 0.0030 | [0.974, 1.044] | 0.61 | |
AWDDQN | 0.0025 | [0.975, 1.040] | 0.95 |
From

Fig. 10 SOC curves of EVA 1 after scheduling by voltage control method with AWDDQN algorithm. (a) IEEE 33-bus system. (b) IEEE 123-bus system.
In this paper, an intelligent voltage control method for ADNs based on AWDDQN algorithm is proposed. Using this method, the agent can intelligently control the adjustable active and reactive power resources according to the states of the ADN. The main conclusions are as follows.
1) The AWDDQN algorithm can intelligently coordinate and control the reactive power of resources and the active power of EVs for ADN voltage control, eliminating the overvoltage. The objective function is optimized from 0.0248 to 0.0072 in the IEEE 33-bus system and from 0.0358 to 0.0025 in the IEEE 123-bus system, respectively, without affecting the charging demands of EV users. These results indicate that the performance of the AWDDQN algorithm is unaffected by the scale of the systems.
2) The proposed method has a much faster speed than that with the traditional MINLP algorithms.
3) The simulation results for the IEEE 33-bus and IEEE 123-bus systems indicate the problems of overestimation in the voltage control method with the DQN algorithm and the underestimation in that with the DDQN algorithm.
4) The proposed method with the AWDDQN algorithm can overcome these shortcomings by introducing the average weighted estimators, resulting in better evaluation of the action reward values and better reward convergence values.
5) The complexity of the design of AWDDQN target increases the calculation time at acceptable levels.
References
H. Zhou, S. Chen, J. Lai et al., “Modeling and synchronization stability of low-voltage active distribution networks with large-scale distributed generations,” IEEE Access, vol. 6, pp. 70989-71002, Nov. 2018. [Baidu Scholar]
S. Xia, S. Bu, C. Wan et al., “A fully distributed hierarchical control framework for coordinated operation of DERs in active distribution power networks,” IEEE Transactions on Power Systems, vol. 34, no. 6, pp. 5184-5197, Nov. 2019. [Baidu Scholar]
C. Sarimuthu, V. Ramachandaramurthy, K. Agileswari et al., “A review on voltage control methods using on-load tap changer transformers for networks with renewable energy sources,” Renewable and Sustainable Energy Reviews, vol. 62, no. 1, pp. 1154-1161, Sept. 2016. [Baidu Scholar]
Z. Liu and X. Guo, “Control strategy optimization of voltage source converter connected to various types of AC systems,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 1, pp. 77-84, Jan. 2021. [Baidu Scholar]
R. A. Jabr, “Power flow based volt/var optimization under uncertainty,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1000-1006, Sept. 2021. [Baidu Scholar]
H. Li, M. A. Azzouz, and A. A. Hamad, “Cooperative voltage control in MV distribution networks with electric vehicle charging stations and photovoltaic DGs,” IEEE Systems Journal, vol. 15, no. 2, pp. 2989-3000, Jun. 2020. [Baidu Scholar]
Y. Zheng, Y. Song, D. J. Hill et al., “Online distributed MPC-based optimal scheduling for EV charging stations in distribution systems,” IEEE Transactions on Industrial Informatics, vol. 15, no. 2, pp. 638-649, Feb. 2019. [Baidu Scholar]
M. Mazumder and S. Debbarma, “EV charging stations with a provision of V2G and voltage support in a distribution network,” IEEE Systems Journal, vol. 15, no. 1, pp. 662-671, Mar. 2021. [Baidu Scholar]
A. Ahmadian, B. Mohammadi-Ivatloo, and A. Elkamel, “A review on plug-in electric vehicles: introduction, current status, and load modeling techniques,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 3, pp. 412-425, May 2020. [Baidu Scholar]
H. Patil and V. N. Kalkhambkar, “Grid integration of electric vehicles for economic benefits: a review,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 1, pp. 13-26, Jan. 2021. [Baidu Scholar]
X. Sun and J. Qiu, “Hierarchical voltage control strategy in distribution networks considering customized charging navigation of electric vehicles,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 4752-4764, Nov. 2021. [Baidu Scholar]
Y. Wang, T. John, and B. Xiong, “A two-level coordinated voltage control scheme of electric vehicle chargers in low-voltage distribution networks,” Electric Power Systems Research, vol. 168, no. 1, pp. 218-227, Mar. 2018. [Baidu Scholar]
Y. Liu and H. Liang, “A discounted stochastic multiplayer game approach for vehicle-to-grid voltage regulation,” IEEE Transactions on Vehicular Technology, vol. 68, no. 10, pp. 9647-9659, Oct. 2019. [Baidu Scholar]
J. Hu, C. Ye, Y. Ding et al., “A distributed MPC to exploit reactive power V2G for real-time voltage regulation in distribution networks,” IEEE Transactions on Smart Grid, vol. 13, no. 1, pp. 576-588, Jan. 2022. [Baidu Scholar]
Y. Zhang, X. Wang, J. Wang et al., “Deep reinforcement learning based volt-var optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 361-371, Jan. 2021. [Baidu Scholar]
H. Diao, M. Yang, F. Chen et al., “Reactive power and voltage optimization control approach of the regional power grid based on reinforcement learning theory,” Transactions of China Electrotechnical Society, vol. 30, no. 12, pp. 408-414, Jun. 2015. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Dec. 2020. [Baidu Scholar]
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. [Baidu Scholar]
J. Shi, W. Zhou, N. Zhang et al., “Deep reinforcement learning algorithm of voltage regulation in distribution network with energy storage system,” Electric Power Construction, vol. 3, pp. 1-8, Mar. 2020. [Baidu Scholar]
R. Diao, Z. Wang, D. Shi et al., “Autonomous voltage control for grid operation using deep reinforcement learning,” in Proceeding of 2019 IEEE PES General Meeting (PESGM), Atlanta, USA, Aug. 2019, pp. 1-5. [Baidu Scholar]
Q. Yang, G. Wang, A. Sadeghi et al., “Two-timescale voltage control in distribution grids using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2313-2323, May 2020. [Baidu Scholar]
X. Sun and J. Qiu, “A customized voltage control strategy for electric vehicles in distribution networks with reinforcement learning method,” IEEE Transactions on Industrial Informatics, vol. 17, no. 10, pp. 6852-6863, Oct. 2021. [Baidu Scholar]
H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, pp. 1-13, Mar. 2016. [Baidu Scholar]
O. Lukianykhin and T. Bogodorova, “Voltage control-based ancillary service using deep reinforcement learning” Energies, vol. 14, no. 8, pp. 1-22. Apr. 2021, [Baidu Scholar]
Z. Zhang, Z. Pan, and M. J. Kochenderfer, “Weighted double Q-learning,” in Proceeding of International Joint Conference on Artificial Intelligence, Melbourne, Australia, Aug. 2018, pp. 3455-3461. [Baidu Scholar]
J. Wu, Q. Liu, S. Chen et al., “Averaged weighted double deep Q-network,” Journal of Computer Research and Development, vol. 57, no. 3, pp. 576-589, Jun. 2020. [Baidu Scholar]
H. Zhang, Z. Hu, Z. Xu et al., “Evaluation of achievable vehicle-to-grid capacity using aggregate PEV model,” IEEE Transactions on Power Systems, vol. 32, no. 1, pp. 784-794, Jan. 2017. [Baidu Scholar]
H. Liang, Z. Lee, and G. Li, “A calculation model of charge and discharge capacity of electric vehicle cluster based on trip chain,” IEEE Access, vol. 8, pp. 142026-142042, Aug. 2020. [Baidu Scholar]
F. D. Kanellos, “Optimal scheduling and real-time operation of distribution networks with high penetration of plug-in electric vehicles,” IEEE Systems Journal, vol. 15, no. 3, pp. 3938-3947, Sept. 2021. [Baidu Scholar]
R. Su, F. Wu, and J. Zhao, “Deep reinforcement learning method based on DDPG with simulated annealing for satellite attitude control system,” in Proceedings of 2019 Chinese Automation Congress (CAC), Hangzhou, China, Nov. 2019, pp. 390-395. [Baidu Scholar]
Z. Wang, J. Wang, B. Chen et al., “MPC-based voltage/var optimization for distribution circuits with distributed generators and exponential load models,” IEEE Transactions on Smart Grid, vol. 5, no. 5, pp. 2412-2420, Sept. 2014. [Baidu Scholar]