Abstract
The optimal dispatch methods of integrated energy systems (IESs) currently struggle to address the uncertainties resulting from renewable energy generation and energy demand. Moreover, the increasing intensity of the greenhouse effect renders the reduction of IES carbon emissions a priority. To address these issues, a deep reinforcement learning (DRL)-based method is proposed to optimize the low-carbon economic dispatch model of an electricity-heat-gas IES. In the DRL framework, the optimal dispatch model of the IES is formulated as a Markov decision process (MDP). A reward function based on the reward-penalty ladder-type carbon trading mechanism (RPLT-CTM) is introduced to enable the DRL agents to learn more effective dispatch strategies. Moreover, a distributed proximal policy optimization (DPPO) algorithm, which is a novel policy-based DRL algorithm, is employed to train the DRL agents. The multithreaded architecture enhances the exploration ability of the DRL agents in complex environments. Experimental results illustrate that the proposed DPPO-based IES dispatch method can mitigate carbon emissions and reduce the total economic cost. The RPLT-CTM-based reward function outperforms the CTM-based methods, providing a 4.42% and 6.41% decrease in operating cost and carbon emission, respectively. Furthermore, the superiority and computational efficiency of DPPO compared with other DRL-based methods are demonstrated by a decrease of more than 1.53% and 3.23% in the operating cost and carbon emissions of the IES, respectively.
THE limitations of traditional energy sources and the diversity of human needs pose considerable challenges to current energy structures [
Recently, research work on the ED of IESs has received increasing attention. However, the fluctuation and randomness of renewable energy and load represent a source of uncertainty, thus complicating the solution to the ED problem for IESs [
A relevant aspect to consider in the development of IESs is global warming, which is caused by the emission of greenhouse gases with CO2 as the main component [
Traditional dispatch methods are based on day-ahead forecasting information. However, these methods do not consider uncertainties of load demand and renewable energy generation. Mathematical programming-based methods have been developed to solve ED problems while considering these uncertainties. Reference [
However, these dispatch methods have certain limitations. Scenario-based SO may require the generation of several scenarios based on probability distributions, resulting in a severe increase in computational burden. More importantly, the optimal dispatch results may not satisfy the constraints of scenarios that are not considered [
Control theory-based methods such as model predictive control (MPC) have also been used to address uncertainties in the optimal operation problem. Reference [
Reference | Method | Description |
---|---|---|
[ | SO | Many scenarios need to be generated. A severe computational burden may be incurred. The optimal dispatch results may not satisfy the constraints of scenarios that are not considered. |
[ | RO | The results are conservative because the worst case of uncertainty is considered. |
[ | SO-RO | The operating cost and reliability of the system are considered. Appropriate scenarios are required. |
[ | DRO | The advantages of SO and RO are combined. The modeling and solving processes are complex. |
[ | MPC | Rolling optimization is applied to offset uncertainty. The process is complicated, and the optimization quality relies on the forecast accuracy of uncertain variables. |
[ | IGDT | The choice of some coefficients is subjective. |
[ | DRL | Instead of relying on prior knowledge, the agent collects data by interacting with the environment and learning from data. The agent can be applied to real-time dispatch after offline training. |
In contrast to the aforementioned methods, the DRL agent collects data by interacting with the IES environment and learns a dispatch strategy from the data. In some studies, DRL algorithms have been applied in discrete action spaces to solve optimal dispatch problems that consider uncertainties in microgrids [
In [
To satisfy the energy demands of an IES and minimize operating costs and pollutant emissions, [
Reference [
Several studies have attempted to introduce the CTM into DRL-based frameworks. Reference [
Most studies applying DRL methods to solve the optimal dispatch problem while accounting for uncertainties have not considered the carbon emissions of the system. Only a few studies have considered carbon emissions by introducing a traditional CTM-based reward function to obtain a low-carbon ED model for the IES. However, as the reward function affects the effectiveness of the strategy learned by the agent, it should be carefully designed within the DRL framework. Moreover, the introduction of CTM increases the complexity of the DRL environment. Hence, a more efficient algorithm is required for the agent to learn low-carbon ED strategies.
To address the existing research gap, a DRL-based dynamic energy dispatch method is proposed for the low-carbon economic operation of an electricity-heat-gas IES. A comparison of the elements considered in the development of our model and those presented in the reviewed models is presented in
Reference | Action space | Energy | Dispatch | CTM | |||||
---|---|---|---|---|---|---|---|---|---|
Discrete | Continuous | Electricity | Heat | Gas | Economy | Emission | Traditional | Ladder-type | |
[ | √ | √ | √ | ||||||
[ | √ | √ | √ | √ | |||||
[ | √ | √ | √ | ||||||
[ | √ | √ | √ | √ | |||||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | √ | |||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | ||||
[ | √ | √ | √ | √ | √ | √ | |||
Proposed model | √ | √ | √ | √ | √ | √ | √ |
To achieve low-carbon operation of the system, a reward-penalty ladder-type CTM (RPLT-CTM) is introduced into the DRL framework. The RPLT-CTM models the principles that guide enterprises to reduce emissions. For this reason, we decide to use the RPLT-CTM-based reward function with variable carbon trading prices to guide the agent more effectively in learning the low-carbon economic scheduling strategy for the IES. Moreover, to solve the optimal scheduling problem, the distributed proximal policy gradient (DPPO) algorithm is introduced, which is a policy-based DRL algorithm that is less sensitive to hyperparameters and can avoid large policy updates with undesirable action selections.
The major contributions can be summarized as follows.
1) A DRL-based method for low-carbon ED of an electricity-heat-gas IES, which considers economics and carbon emissions, is established. The low-carbon ED is mathematically modeled as a Markov decision process (MDP).
2) The RPLT-CTM is introduced into the DRL framework to realize low-carbon ED. Compared with the traditional CTM, the RPLT-CTM-based reward function has been proven to guide the DRL agent in formulating an improved low-carbon ED strategy.
3) To address the increased complexity introduced by the low-carbon objective, the DPPO algorithm with a distributed architecture is introduced to train the DRL agent. A comparative analysis demonstrates the computational effectiveness and superiority of this algorithm.
The remainder of this paper is organized as follows. Section II presents the electricity-heat-gas IES, including the carbon trading cost calculation model for the RPLT-CTM-based IES, and the mathematical model for IES optimal dispatch. In Section III, the optimal dispatch problem is formulated as an MDP, and the DPPO-based method for IES optimal dispatch is described in detail. Simulation results and the corresponding analysis are presented in Section IV. Conclusions and future work are discussed in Section V.
The primary goal of the optimal dispatch of the IES is to improve the economic benefits of the system, i.e., on the premise of satisfying the energy demand, the output of each piece of equipment at each time step is effectively arranged to achieve the optimal economic operation. Furthermore, to realize low-carbon operation of the system, the RPLT-CTM is introduced to incorporate carbon trading costs into the operating costs of the system. To this end, we establish a comprehensive ED model that considers the RPLT-CTM. The structure of electricity-heat-gas IES is shown in

Fig. 1 Structure of electricity-heat-gas IES.
The IES consists of energy suppliers, renewable energy generation devices, load demand, coupling devices, and energy storage devices. Renewable energy generation devices include wind turbines (WTs) and PV generators. The load demand includes electrical, heat, and gas loads. The coupling equipment includes a CHP, power-to-gas (PtG), and gas boiler (GB). The energy storage equipment includes battery energy storage (BES), gas storage tanks (GSTs), and thermal storage tanks (TSTs).
The CTM can guide energy companies to reduce emissions, and its essence is to treat carbon credit allowances as freely tradable commodities [
The allocation of initial carbon credits is a prerequisite for low-carbon power dispatch. The initial carbon emission allowance allocation is performed using the free allocation method.
In the IES model, the electricity purchased from the external grid is produced by coal-fired units. In addition to the equipment in the IES that generates carbon emissions, natural gas loads are also considered. The CHP unit is considered as heat supply equipment, and its carbon credits are allocated according to the equivalent total heat supply. Thus, the power generated by the CHP units needs to be converted into an equivalent heat supply. The model is expressed as:
(1) |
where EIES,c is the total carbon credit allowance of the IES; Egrid,c, ECHP,c, and EGB,c are the carbon credit allowances for coal-fired units, CHP, and GB, respectively; Egload,c is the carbon credit allowance received by the user for the consumption of natural gas; t is the interval for each time step; , , , and are the output power of the coal-fired units, CHP, and GB at time step t, respectively; is the flow rate of the natural gas load at time step t; , , and are the carbon credit allocation factors for the electricity supply equipment, heat supply equipment, and natural gas load, respectively; and is the conversion factor of power generation into heat supply, which is taken as 6 MJ/kWh.
In the IES, the operation of the CHP units and GB generates carbon emission. The electricity purchased from the external grid comes from coal-fired units, the operation of which generates carbon emissions. The consumption of natural gas loads, mainly through combustion, also generates carbon emissions. The working process of the PtG unit involves the absorption of CO2. The carbon emission model of the IES is:
(2) |
where EIES,e is the total carbon emission of the IES; Egrid,e, ECHP,e, EGB,e, and Egload,e are the carbon emissions generated the by coal-fired units, CHP, GB, and natural gas load, respectively; EPtG,e is the amount of CO2 absorbed in the energy conversion process of the PtG unit; , , and are the carbon emission factors for the electricity supply equipment, heat supply equipment, and natural gas load, respectively; is the electric power consumed by the PtG unit at time step t; and is the parameter for the absorption of CO2 in the energy conversion of the PtG unit.
The RPLT-CTM [

Fig. 2 Relationship between carbon trading price and cumulative carbon trading volume.
The mathematical model of the reward and penalty ladder-type carbon trading is expressed as:
(3) |
(4) |
(5) |
where is the amount of carbon trading at time step t; EIES is the cumulative carbon trading volume; is the carbon trading cost of the IES at time step t; c is the carbon trading price; is the penalty factor, which is taken as 0.2; is the reward factor, which is taken as 0.25; and is the length of the carbon trading range.
The primary goal of the IES dynamic energy dispatch is to improve the economy and environmental friendliness of the system while meeting the constraints. The objective function is mainly composed of energy purchase and carbon trading costs. The objective function F of the optimal dispatch is defined as:
(6) |
where is the energy purchase cost at time step t.
To satisfy the electricity-heat-gas load demand, the system purchases energy from energy suppliers as fuel for the operation of the coupled equipment. The equipment that consumes electrical energy includes the PtG units and electric boiler (EB), and the equipment that consumes natural gas is the CHP units and GB. This cost is expressed as:
(7) |
(8) |
(9) |
where and are the costs of the purchased electricity and natural gas, respectively; is the output flow rate of the natural gas supplier; is the electricity price; and is the natural gas price.
The constraints of IES dynamic scheduling consist of energy balance, equipment operation, and energy supplier constraints.
To meet the electricity-heat-gas load demand at each time step, the energy balance constraints are:
(10) |
(11) |
(12) |
where is the renewable energy generation; is the charging/discharging power of the BES; is the electric power consumed by the EB; is the power output of the EB; is the charging/discharging power of the TST; is the output flow rate of PtG; is the charging/discharging power of the GST; is the flow rate of natural gas consumed by CHP; is the flow rate of natural gas consumed by the GB; and and are the electrical load and heat load, respectively.
① Energy supply devices
a) CHP
The CHP unit provides heat and electricity to the system and acts as an energy provider in the electricity and heating networks. The mathematical model of the CHP unit is expressed as:
(13) |
(14) |
where kCHP is the thermoelectric ratio of CHP; is the efficiency of CHP; and HGV is the high calorific value of natural gas, which is taken as 39 MJ/
The power output and ramping rate constraints of the CHP unit are given by (15)-(18).
(15) |
(16) |
(17) |
(18) |
where and are the lower and upper bounds of the output electric power, respectively; and are the lower and upper bounds of the output heat power of CHP, respectively; and are the output electric and heat power of CHP at time step , respectively; and and are the ramping rates of CHP.
b) PtG
The PtG unit converts electric power into gas. The relationship between the electric power consumption and the natural gas supply is expressed as:
(19) |
where is the efficiency of PtG.
The power and ramping rate constraints of the PtG unit are shown in (20) and (21), respectively.
(20) |
(21) |
where and are the lower and upper bounds of the consumed electric power, respectively; is the electric power consumed by PtG at time step ; and and are the ramping rates of PtG.
c) EB
The EB converts electric power into heat to satisfy the heat load. The relationship between the electric power consumption and the heat supply is expressed as:
(22) |
where is the efficiency of the EB.
The power output and ramping rate constraints of the EB are shown in (23) and (24), respectively.
(23) |
(24) |
where and are the lower and upper bounds of the output heat power of the EB, respectively; is the power output of the EB at time step ; and and are the ramping rates of the EB.
d) GB
The GB converts natural gas power into heat power, which is used to supplement the remaining heat load demand when the CHP heat supply is insufficient. The relationship between the natural gas power consumption and the heat supply is expressed as:
(25) |
where is the efficiency of the GB.
The power output and ramping rate constraints of the GB are given by (26) and (27), respectively.
(26) |
(27) |
where and are the lower and upper bounds of the output heat power of the GB, respectively; is the power output of the GB at time step ; and and are the ramping rates of the GB.
② Energy storage equipment
a) BES
The BES can store excess energy in the system, which can be reasonably discharged to meet the electrical demand of customers in case of insufficient power supply. For the BES, the state of charge (SOC) is a key operational parameter that directly reflects the remaining energy of the device.
(28) |
(29) |
(30) |
where and are the SOCs of the BES at time steps t and , respectively; SOCmin and SOCmax are the lower and upper bounds of the SOC of the BES, respectively; QBES is the capacity of the BES; is the charging/discharging efficiency of the BES; and and are the charging and discharging coefficients, respectively.
b) TST
Similar to the BES, the TST can store excess heat and supply the heat needed for a heat load in the event of a heating shortage. Similar to the definition of SOC, the heat storage degree (HSD) is defined to monitor the amount of heat energy that can be stored in the equipment.
(31) |
(32) |
(33) |
where and are the HSDs of the TST at time steps t and , respectively; HSDmin and HSDmax are the lower and upper bounds of the HSD of the TST, respectively; QTST is the capacity of the TST; and is the charging/discharging efficiency of the TST.
c) GST
The gas storage degree (GSD) of the GST is defined to monitor the amount of natural gas energy that can be stored in the equipment.
(34) |
(35) |
(36) |
where and are the GSDs of the GST at time steps t and , respectively; GSDmin and GSDmax are the lower and upper bounds of the GSD of the GST, respectively; QGST is the capacity of the GST; and is the charging/discharging efficiency of the GST.
③ Energy supplier constraints
In the dispatching model established in this paper, electricity and natural gas need to be purchased from external sources to supply the equipment and meet the load demand. The energy supply device satisfies the following constraints.
(37) |
(38) |
where and are the lower and upper bounds of the output electric power of the coal-fired units, respectively; and and are the lower and upper bounds of the output gas flow rate of the supplier, respectively.
In this section, the IES optimal dispatch is formulated as an MDP, and the specific reinforcement learning algorithm is explained.
MDP is a mathematically idealized form of the RL problem and a theoretical framework for achieving goals through interactive learning. An MDP consists of a state space S, action space A, state transition probability function P, reward function R, and discount coefficient .
An RL framework is built to solve the low-carbon ED problem for an IES, as shown in

Fig. 3 RL framework for IES optimal dispatch.
The state space S contains the information that describes the state of the IES, and the dispatch agent decides the dispatch strategy based on the observed state at each time step. Specifically, the state space S includes the electrical load , heat load , natural gas load , power output of renewable energy , SOC of the BES , status (HSD) of the TST , and status (GSD) of the GST . Consequently, the state space is defined as:
(39) |
The dispatch agent realizes the optimal scheduling strategy for the IES by controlling the electric and heat power outputs of CHP (, ), heat power output of the EB , heat power output of the GB , the gas power output of PtG , electric power purchased from the main grid , natural gas power purchased from the natural gas supplier , electric power output of the BES , heat power output of the TST , and natural gas power output of the GST . The electric and natural gas power consumed by each device in the system such as is calculated from its output power. The energies purchased from external energy suppliers, and , are calculated using electric power balance constraints and gas power balance constraints, respectively. The heat power output of the GB can also be calculated using the heat power balance constraint. That is, when , , , , , and are jointly determined, the other variables can be obtained immediately. Therefore, action space is expressed as:
(40) |
The reward function calculates the reward value rt based on the current state and action , then returns it to the agent. The purpose of the reward is to guide the agent to accomplish the stated goal, i.e., low carbon emissions and ED of the IES. Therefore, the reward function includes the operating cost CE and carbon trading cost CCT of the system. Considering that the goal of the training agent in reinforcement learning is to maximize the cumulative reward, the reward value needs to be set to be a negative value. To accelerate convergence, a baseline b is added to the reward function so that positive and negative reward values can be given. The reward function can be defined as:
(41) |
where b is taken as 30.
The stochastic nature of renewable energy generation and multiple energy loads needs to be considered in the IES optimal dispatch problem. To enable the agent to handle this uncertainty, the RL environment for the optimal scheduling problem needs to be established with stochasticity. Before the start of training for each episode, the environment randomly samples the load data that satisfy the upper and lower bound limits.
In each episode, a group of states is generated within the upper and lower limits. The energy loads and the renewable energy generation are generated randomly within the predefined range, which means that the dispatch strategy given by the agent can handle not only the uncertainty of loads but also the uncertainty of renewable energy generation.
The DRL algorithm is introduced to solve the optimal dispatch problem for a continuous action space. PPO [
The PPO algorithm is a policy-based DRL algorithm with an actor-critic architecture. The advantage function is introduced to evaluate the goodness of action at in state st.
(42) |
The action-value (Q-value) function is used to evaluate the performance of policy , and is defined as:
(43) |
where is the policy with parameter θ; and is the reward discount factor.
The state-value function is used to evaluate the quality of state st, and is expressed as:
(44) |
From (43) and (44), the value of the action value function represents the expectation of the cumulative reward for choosing action at in state st under the guidance of policy network . Furthermore, the value of the state-value function represents the expectation of the cumulative reward for all actions in state st under policy .
With the introduction of the advantage function , the original objective function can be rewritten as:
(45) |
where is the parameter of the policy network to be optimized; and is the parameter of the policy network that interacts with the environment to sample data. This is the surrogate objective function.
Next, the clipped surrogate objective method is employed. The surrogate objective function is written as:
(46) |
(47) |
(48) |
where is a surrogate objective function clipping rate applied to limit the change in policy.
The clip function limits the probability ratio to a certain range and takes the maximum or minimum value if it is out of range. By clipping the probability ratio, changes in policy are maintained within a reasonable range. This ensures that the change in policy is not too intense when the advantage is positive and that the update direction is correct when the advantage is negative. Finally, the PPO algorithm updates the policy network parameters using gradient ascent.
(49) |
where is the learning rate of the policy network.
The PPO algorithm has an actor-critic architecture. After updating the policy network, i.e., actor network, the critic network is updated by minimizing the loss function based on temporal-different (TD) theory.
(50) |
(51) |
(52) |
where is the loss function; and is the learning rate of the Q-value network, i.e., critic network.
To train the agent to obtain better performance in the established optimal IES scheduling environment, the agent must fully explore the environment to face different scenarios. Therefore, the PPO algorithm with distributed settings was introduced to achieve better training performance. DPPO includes workers and a chief, where the workers are set up as multiple threads responsible for interacting with their respective environments to sample data and provide the data to the chief for learning. All parallel threads share the same policy network parameters from the global learner. The chief updates the network parameters and passes the pre-updated parameters to the workers. Each worker does not compute or push the gradient of its own policy update to the chief; this method promotes the efficiency of the multithreaded data collection and reduces the difficulty in implementing the algorithm. The framework of the DPPO algorithm training process is illustrated in

Fig. 4 Framework of DPPO algorithm training process.
The distributed setting of DPPO is reflected in the parallel collection of data based on the multithreaded worker network for the chief network update. In simple terms, DPPO can be understood as a multithreaded parallel PPO. The training process of DPPO is realized through multithreading and communication among multiple threads. The exploration thread of the workers and the update thread of the chief are not executed simultaneously and communicate through events. The flow of the alternating execution of multiple threads in DPPO is shown in

Fig. 5 Flow of alternating execution of multiple threads in DPPO.
At the beginning of training, the exploration event is set to be “set”, and workers start interacting with the environment to collect data. The update event is set to be “clear” and enters the waiting state. In the exploration thread, the global variable global_update_counter is used to record the number of steps taken by the workers to interact with the environment. When the value of global_update_counter is larger than the mini-batch size, the update event is set to be “set” and the chief network starts to update. The exploration event is set to be “clear” and will enter the waiting state when running to “wait”. After the chief network update is complete, the update event is set to be “clear” and suspended. The exploration event is set to be “set” and workers continue to interact with the environment to collect data. The offline training process of the DPPO algorithm is shown in the pseudocode in
Algorithm 1 : off-line training process of DPPO |
---|
Initialize parameters and randomly Initialize old actor parameters: exploration_event.set(), update_event.clear() global_update_ for to N do if not exploration_event.set() then exploration_event.wait() end if Exploration thread for to U do Reset the initial state of IES dispatch environment Generate random scenario for dispatch time step to T do Observe state Select energy dispatch action by old actor Execute action Calculate state of equipment by (13)-(38) Calculate reward by (41) Obtain the next state global_update_ if global_update__batch_size then exploration_event.clear() update_event.set() end if end for end for Get trajectory and push data to chief if not update_event.set() then update_event.wait() end if Update thread for to M do Calculate loss function L( Update parameters of critic network Calculate surrogate objective function by (46) Update parameters of new actor by (49) Update parameters of old actor: end for global_update_ update_event.clear() exploration_event.set() end for |
In this section, a platform for IES optimal scheduling is established and experiments are conducted using this IES platform to verify the superiority of the proposed DPPO-based dispatch method. The parameter settings, experimental details, and concluding analysis are presented in the following subsections.
To demonstrate the performance of the proposed DPPO-based dispatch method, the IES shown in
The purchasing electricity price is the time-of-use (TOU) price. The peak-time price is 12.3 ¢/kWh (12:00-20:00), the valley-time price is 4.2 ¢/kWh (00:00-08:00), and the flat-time price is 7.8 ¢/kWh at all other time. The natural gas price is fixed at 49 ¢/
Parameter | Value | Parameter | Value |
---|---|---|---|
βe (t/MWh) | 1.08 | λe (t/MWh) | 0.798 |
βh (t/MWh) | 0.234 | λh (t/MWh) | 0.385 |
βgas (t/ |
2.166×1 |
λgas (t/ |
1.95×1 |
βPtG (t/MWh) | 0.106 |
The parameters of the equipment operating constraints are provided in
Equipment | The minimum power (MW) | The maximum power (MW) | Climbing power (MW) |
---|---|---|---|
CHP | 0.2 | 1.2 | 0.1250 |
PtG | 0.0 | 0.5 | 0.0625 |
EB | 0.0 | 0.6 | 0.0750 |
GB | 0.0 | 0.6 | 0.0750 |
Equipment | Capacity (MWh) | Charging efficiency | Discharging efficiency |
---|---|---|---|
BES | 0.30 | 0.92 | 0.85 |
TST | 0.30 | 0.95 | 0.95 |
GST | 0.54 | 0.98 | 0.98 |
The proposed method and compared algorithms were implemented using TensorFlow and MATLAB. Simulation experiments were performed on a server with an Intel Xeon Gold 6230R CPU and an NVIDIA Quadro RTX 5000 GPU.
The core hyperparameter settings used for training the DPPO algorithm are listed in
Hyperparameter | Value |
---|---|
Learning rate for actor network | 0.0001 |
Learning rate for critic network | 0.0002 |
Discount factor | 0.97 |
The maximum episode | 10000 |
Step in each episode | 96 |
Mini-batch size | 64 |
Surrogate objective function clipping rate | 0.2 |
Number of parallel workers | 4 |
The DRL environment used to train the agent to learn a low-carbon economy dispatch policy was implemented based on Python 3.6, the framework of which is described in detail in Section III.
To verify the effectiveness of the established environment, an agent is trained in it using the DPPO algorithm. After testing different combinations of hyperparameters, the training results for the original version of the DRL environment are found to be poor. Therefore, to achieve better training results, state normalization (whitening) and reward normalization (whitening) are introduced. The cumulative rewards obtained from training in environments in which different tricks are applied are shown in

Fig. 6 Comparison of cumulative rewards in DRL environments with tricks.
In
By comparing and analyzing the training results of different environments, we notice that in the environment established in this study, the actor network and critic network are more suitable for the input-normalized states. In addition, the normalization of the reward helps the DRL agent to learn the dispatch strategy more effectively.
To analyze the benefits of introducing the RPLT-CTM for the low-carbon economic operation of IES, two scenarios are set up for comparative analysis, which are described as follows.
1) Scenario 1: the CTM is a carbon tax model in which the price of buying or selling carbon rights is fixed and does not change with the volume of carbon rights traded.
2) Scenario 2: the CTM is the RPLT-CTM model, the details of which are described in Section II.
To demonstrate the effectiveness of the proposed method, the actual operational data of an IES [

Fig. 7 Load demand and renewable energy generation on test day.
To intuitively compare the characteristics of the carbon trading models, the agent trained based on the DPPO algorithm in the two scenarios provides the scheduling plan according to the agent trained based on the DPPO algorithm in the two scenarios shown in

Fig. 8 Operating costs and carbon emission based on two scenarios.
Scenario | Carbon credit (t) | Carbon emission (t) | Carbon trading cost ($) | Operating cost ($) |
---|---|---|---|---|
Scenario 1 | 15.89 | 12.16 | -179.22 | 1872.00 |
Scenario 2 | 15.54 | 11.38 | -224.99 | 1789.24 |
Evidently,
The dispatch results of the IES based on DPPO for the test day in Scenario 2 are shown in

Fig. 9 Dispatch results of IES based on DPPO for test day in Scenario 2. (a) Electrical network. (b) Heating network. (c) Natural gas network.
In
Guided by the RPLT-CTM, the agent selects a dispatch plan with low carbon emissions and high economic efficiency. The detailed analysis of the scheduling results shows that the DPPO-trained dispatch agent provides real-time dispatch results according to the load demand and can achieve low-carbon and economic operation of the system by ensuring the safe and stable operation of the IES.
To verify the performance of the DPPO algorithm, DPPO algorithm is compared with other DRL algorithms and traditional algorithms in this subsection.
Since DPPO is a distributed version of PPO, PPO is chosen for comparison. The benchmark DRL algorithms, DDPG and twin-delayed DDPG (TD3), are selected. SAC, another popular DRL algorithm, is also used for comparison. Considering that DPPO is a distributed DRL algorithm, A3C and distributed distributional deterministic policy gradients (D4PG) are also introduced. In addition, the double deep Q-network (DDQN), an improved extension of the DQN algorithm, is employed as another benchmark DRL algorithm.
The cumulative rewards of DPPO and other DRL algorithms in the training process are shown in

Fig. 10 Cumulative rewards of DPPO and other DRL algorithms in training process.
In addition, PSO-, GA-, and SO-based scheduling algorithms are introduced to compare IES operating costs and carbon emissions. The operating costs and carbon emissions of the scheduling plans for the test day provided by these algorithms are listed in
Algorithm | Carbon credit (t) | Carbon emission (t) | Carbon trading cost ($) | Operating cost ($) |
---|---|---|---|---|
DPPO | 15.54 | 11.38 | 224.99 | 1789.24 |
D4PG | 15.94 | 11.99 | 220.42 | 1817.08 |
TD3 | 15.99 | 12.17 | 214.79 | 1820.42 |
PPO | 15.95 | 11.76 | 219.44 | 1828.25 |
DDPG | 16.34 | 12.22 | 217.19 | 1841.59 |
A3C | 16.95 | 13.38 | 188.78 | 1859.15 |
SAC | 17.19 | 14.42 | 152.67 | 2010.06 |
DDQN | 18.07 | 15.84 | 133.29 | 2042.46 |
GA | 17.90 | 14.41 | 191.01 | 1880.23 |
PSO | 17.74 | 13.52 | 222.37 | 1889.07 |
SO | 16.79 | 12.48 | 224.83 | 1860.24 |
The results show that DRL-based dispatch algorithms with a continuous action space outperform the PSO- and SO-based algorithms. This is a consequence of the fact that DRL-based dispatch algorithms do not rely on day-ahead forecast information or an assumed distribution of uncertainty. In contrast, the DRL-based algorithm (DDQN) with a discrete action space is limited to a finite number of actions available in the action space. Therefore, its scheduling results are the worst among all algorithms.
The above analysis suggests that the DPPO-based method has higher learning efficiency and a better dispatch strategy than the other DRL-based algorithms. A comparison with other dispatch algorithms shows that the DPPO-based method also provides a better dispatch strategy.
In this paper, considering the uncertainty of load demand and renewable energy, a low-carbon ED method for electricity-heat-gas IES based on DRL is proposed. A reward function based on the RPLT-CTM is introduced to guide the DRL agent to learn low-carbon dispatch actions. A DRL agent trained by DPPO realizes the real-time low-carbon ED of an IES. The following conclusions are drawn.
1) Benefiting from the ladder-type dynamic trading price, the RPLT-CTM effectively guides the DRL agent to learn a low-carbon ED strategy. The dispatch results verify that the agent based on the RPLT-CTM makes a dispatch plan with lower carbon emissions compared with the agent based on the traditional CTM.
2) The effectiveness of the proposed DRL-based method for low-carbon ED of an electricity-heat-gas IES is demonstrated by the dispatch results on the test day. The agent trained using the proposed method controls the dispatch actions of each device in the IES in real time. The dispatch plan generated by the agent achieves the low-carbon economic operation of the electricity-heat-gas IES.
3) The superiority of DPPO is verified through a comparative analysis. The distributed architecture of DPPO enables it to perform better than PPO in complex training environments. Compared with the scheduling results of PPO, DPPO reduces the operating cost and carbon emissions by 2.13% and 3.23%, respectively. Compared with other distributed DRL algorithms (D4PG and A3C), the operating cost and carbon emissions of the DPPO-based method are reduced by 1.53%, 3.76% and 5.09%, 14.95%, respectively. DPPO is also compared with other DRL algorithms (DDPG, A3C, SAC, and DDQN) and dispatch algorithms (GA, PSO, and SO). The operating costs of the DPPO-based dispatch method are reduced by 2.84%, 3.76%, 10.99%, 12.40%, 4.84%, 5.28%, and 3.82%, and the carbon emissions are reduced by 6.87%, 14.95%, 21.08%, 28.16%, 21.03%, 15.83%, and 8.81%, respectively.
In future work, considering the characteristics of multiple operators of IES, multi-agent reinforcement learning will be applied to the optimal operation of an IES.
References
P. Li, Z. Wang, J. Wang et al., “Two-stage optimal operation of integrated energy system considering multiple uncertainties and integrated demand response,” Energy, vol. 225, p. 120256, Jun. 2021. [Baidu Scholar]
Y. Li, M. Han, Z. Yang et al., “Coordinating flexible demand response and renewable uncertainties for scheduling of community integrated energy systems with an electric vehicle charging station: a bi-level approach,” IEEE Transactions on Sustainable Energy, vol. 12, no. 4, pp. 2321-2331, Oct. 2021. [Baidu Scholar]
L. Chen, Q. Xu, Y. Yang et al., “Community integrated energy system trading: a comprehensive review,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 6, pp. 1445-1458, Nov. 2022. [Baidu Scholar]
W. Wang, S. Huang, G. Zhang et al., “Optimal operation of an integrated electricity-heat energy system considering flexible resources dispatch for renewable integration,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 4, pp. 669-710, Jul. 2021. [Baidu Scholar]
W. Wang, S. Huang, G. Zhang et al., “Optimal operation of an integrated electricity-heat energy system considering flexible resources dispatch for renewable integration,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 4, pp. 699-710, Jul. 2021. [Baidu Scholar]
R. Rocchetta, L. Bellani, M. Compare et al., “A reinforcement learning framework for optimal operation and maintenance of power grids,” Applied Energy, vol. 241, pp. 291-301, May 2019. [Baidu Scholar]
A. T. D. Perera and P. Kamalaruban, “Applications of reinforcement learning in energy systems,” Renewable and Sustainable Energy Reviews, vol. 137, p. 110618, Mar. 2021. [Baidu Scholar]
T. Yang, L. Zhao, W. Li et al., “Reinforcement learning in sustainable energy and electric systems: a survey,” Annual Reviews in Control, vol. 49, pp. 145-163, Apr. 2020. [Baidu Scholar]
L. He, Z. Lu, J. Zhang et al., “Low-carbon economic dispatch for electricity and natural gas systems considering carbon capture systems and power-to-gas,” Applied energy, vol. 224, pp. 357-370, Aug. 2018. [Baidu Scholar]
H. Vella, “Last chance for carbon trading? Leaders at the COP26 climate conference will consider how to create a framework for global cooperation on carbon markets, which could be a key breakthrough for climate change mitigation,” Engineering & Technology, vol. 16, no. 10, pp. 1-4, Nov. 2021. [Baidu Scholar]
The People’s Government of Hainan Province. (2023, Jan.). Hainan International Carbon Emission Trading Center achieved its first cross-border carbon trading. [Online]. Available: https://www.hainan.gov.cn/hainan/5309/202301/7a3d3c12136f43e986b95578dd90de08.shtml [Baidu Scholar]
Y. Li, Y. Zou, Y. Tan et al., “Optimal stochastic operation of integrated low-carbon electric power, natural gas, and heat delivery system,” IEEE Transactions on Sustainable Energy, vol. 9, no. 1, pp. 273-283, Jan. 2018. [Baidu Scholar]
S. Lu, W. Gu, S. Zhou et al., “Adaptive robust dispatch of integrated energy system considering uncertainties of electricity and outdoor temperature,” IEEE Transactions on Industrial Informatics, vol. 16, no. 7, pp. 4691-4702, Jul. 2020. [Baidu Scholar]
A. Mansour-Saatloo, Y. Pezhmani, M. A. Mirzaei et al., “Robust decentralized optimization of multi-microgrids integrated with power-to-X technologies,” Applied Energy, vol. 304, p. 117635, Dec. 2021. [Baidu Scholar]
N. Nasiri, S. Zeynali, S. N. Ravadanegh et al., “A hybrid robust-stochastic approach for strategic scheduling of a multi-energy system as a price-maker player in day-ahead wholesale market,” Energy, vol.235, p. 121398, Nov. 2021. [Baidu Scholar]
M. A. Mirzaei, K. Zare, B. Mohammadi-Ivatloo et al., “Robust network-constrained energy management of a multiple energy distribution company in the presence of multi-energy conversion and storage technologies,” Sustainable Cities and Society, vol. 74, p. 103147, Nov. 2021. [Baidu Scholar]
Y. Zhang, F. Zheng, S. Shu et al., “Distributionally robust optimization scheduling of electricity and natural gas integrated energy system considering confidence bands for probability density functions,” International Journal of Electrical Power & Energy Systems, vol. 123, p. 106321, Dec. 2020. [Baidu Scholar]
X. Lu, Z. Liu, L. Ma et al., “A robust optimization approach for optimal load dispatch of community energy hub,” Applied Energy, vol. 259, p. 114195, Feb. 2020. [Baidu Scholar]
Z. Li, L. Wu, Y. Xu et al., “Multi-stage real-time operation of a multi-energy microgrid with electrical and thermal energy storage sets: a data-driven MPC-ADP approach,” IEEE Transactions on Smart Grid, vol. 13, no. 1, pp. 213-226, Jan. 2022. [Baidu Scholar]
X. Jin, Q. Wu, H. Jia et al., “Optimal integration of building heating loads in integrated heating/electricity community energy systems: a bi-level MPC approach,” IEEE Transactions on Sustainable Energy, vol. 12, no. 3, pp. 1741-1754, Jul. 2021. [Baidu Scholar]
N. Nasiri, S Zeynali, S. N. Ravadanegh et al., “A tactical scheduling framework for wind farm-integrated multi-energy systems to take part in natural gas and wholesale electricity markets as a price setter,” IET Generation, Transmission & Distribution, vol. 16, no. 9, pp. 1849-1864, Feb. 2022. [Baidu Scholar]
A. Mansour-Saatloo, R. Ebadi, M. A. Mirzaei et al., “Multi-objective IGDT-based scheduling of low-carbon multi-energy microgrids integrated with hydrogen refueling stations and electric vehicle parking lots,” Sustainable Cities and Society, vol. 74, p. 103197, Nov. 2021. [Baidu Scholar]
Y. Ji, J. Wang, J. Xu et al., “Real-time energy management of a microgrid using deep reinforcement learning,” Energies, vol. 12, no. 12, p. 2291, Jun. 2019. [Baidu Scholar]
Y. Liu, D. Zhang, and H. B. Gooi, “Optimization strategy based on deep reinforcement learning for home energy management,” CSEE Journal of Power and Energy Systems, vol. 6, no. 3, pp. 572-582, Sept. 2020. [Baidu Scholar]
F. Meng, Y. Bai, and J. Jin, “An advanced real-time dispatching strategy for a distributed energy system based on the reinforcement learning algorithm,” Renewable Energy, vol. 178, pp. 13-24, Nov. 2021. [Baidu Scholar]
K. Zhou, K. Zhou, and S. Yang, “Reinforcement learning-based scheduling strategy for energy storage in microgrid,” Journal of Energy Storage, vol. 51, p. 104379, Jul. 2022. [Baidu Scholar]
E. Mocanu, D. C. Mocanu, P. H. Nguyen et al., “On-line building energy optimization using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3698-3708, May 2018. [Baidu Scholar]
T. A. Nakabi and P. Toivanen, “Deep reinforcement learning for energy management in a microgrid with flexible demand,” Sustainable Energy, Grids and Networks, vol. 25, p. 100413, Mar. 2021. [Baidu Scholar]
L. Lei, Y. Tan, G. Dahlenburg et al., “Dynamic energy dispatch based on deep reinforcement learning in IoT-driven smart isolated microgrids,” IEEE Internet of Things Journal, vol. 8, no. 10, pp. 7938-7953, May 2021. [Baidu Scholar]
C. Guo, X. Wang, Y. Zheng et al., “Real-time optimal energy management of microgrid with uncertainties based on deep reinforcement learning,” Energy, vol. 238, p. 121873, Jan. 2022. [Baidu Scholar]
B. Zhang, W. Hu, D. Cao et al., “Deep reinforcement learning-based approach for optimizing energy conversion in integrated electrical and heating system with renewable energy,” Energy Conversion and Management, vol. 202, p. 112199, Dec. 2019. [Baidu Scholar]
S. Zhou, Z. Hu, W. Gu et al., “Combined heat and power system intelligent economic dispatch: a deep reinforcement learning approach,” International Journal of Electrical Power & Energy Systems, vol. 120, p. 106016. Sept. 2020. [Baidu Scholar]
T. Yang, L. Zhao, W. Li et al., “Dynamic energy dispatch strategy for integrated energy system based on improved deep reinforcement learning,” Energy, vol. 235, p. 121377, Nov. 2021. [Baidu Scholar]
Y. Ye, D. Qiu, X. Wu et al., “Model-free real-time autonomous control for a residential multi-energy system using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3068-3082, Jul. 2020. [Baidu Scholar]
L. Zhao, T. Yang, W. Li et al., “Deep reinforcement learning-based joint load scheduling for household multi-energy system,” Applied Energy, vol. 324, p. 119346, Oct. 2022. [Baidu Scholar]
B. Zhang, W. Hu, J. Li et al., “Dynamic energy conversion and management strategy for an integrated electricity and natural gas system with renewable energy: deep reinforcement learning approach,” Energy Conversion and Management, vol. 220, p. 113063, Sept. 2020. [Baidu Scholar]
J. Dong, H. Wang, J. Yang et al., “Optimal scheduling framework of electricity-gas-heat integrated energy system based on asynchronous advantage actor-critic algorithm,” IEEE Access, vol. 9, pp. 139685-139696, Sept. 2021. [Baidu Scholar]
Q. Sun, D. Wang, D. Ma et al., “Multi-objective energy management for we-energy in Energy Internet using reinforcement learning,” in Proceedings of 2017 IEEE Symposium Series on Computational Intelligence, Honolulu, USA, Dec. 2017, pp. 1-6. [Baidu Scholar]
X. Teng, H. Long, and L. Yang, “Integrated electricity-gas system optimal dispatch based on deep reinforcement learning,” in Proceedings of IEEE Sustainable Power and Energy Conference, Nanjing, China, Dec. 2021, pp. 1082-1086. [Baidu Scholar]
B. Zhang, W. Hu, D. Cao et al., “Soft actor-critic-based multi-objective optimized energy conversion and management strategy for integrated energy systems with renewable energy,” Energy Conversion and Management, vol. 243, p. 114381, Sept. 2021. [Baidu Scholar]
G. Zhang, W. Hu, D. Cao et al., “A multi-agent deep reinforcement learning approach enabled distributed energy management schedule for the coordinate control of multi-energy hub with gas, electricity, and freshwater,” Energy Conversion and Management, vol. 255, p. 115340, Mar. 2022. [Baidu Scholar]
T. Chen, S. Bu, X. Liu et al., “Peer-to-peer energy trading and energy conversion in interconnected multi-energy microgrids using multi-agent deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 13, no. 1, pp. 715-727, Jan. 2022. [Baidu Scholar]
D. Qiu, Z. Dong, X. Zhang et al., “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Applied Energy, vol. 309, p. 118403, Mar. 2022. [Baidu Scholar]
Q. Sun, X. Wang, Z. Liu et al., “Multi-agent energy management optimization for integrated energy systems under the energy and carbon co-trading market,” Applied Energy, vol. 324, p. 119646, Oct. 2022. [Baidu Scholar]
D. Qiu, J. Xue, T. Zhang et al., “Federated reinforcement learning for smart building joint peer-to-peer energy and carbon allowance trading,” Applied Energy, vol. 333, p. 120526, Mar. 2023. [Baidu Scholar]
R. Wang, X. Wen, X. Wang et al., “Low carbon optimal operation of integrated energy system based on carbon capture technology, LCA carbon emissions and ladder-type carbon trading,” Applied Energy, vol. 311, p. 118664, Apr. 2022. [Baidu Scholar]
X. Zhang, X. Liu, J. Zhong et al., “Electricity-gas-integrated energy planning based on reward and penalty ladder-type carbon trading cost,” IET Generation, Transmission & Distribution, vol. 13, no. 23, pp. 5263-5270, Dec. 2019. [Baidu Scholar]
J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [Baidu Scholar]
N. Heess, T. B. Dhruva, S. Sriram et al. (2017, Jul.). Emergence of locomotion behaviours in rich environments. [Online]. Available: https://arxiv.org/abs/1707.02286 [Baidu Scholar]