Abstract
Building integrated energy systems (BIESs) are pivotal for enhancing energy efficiency by accounting for a significant proportion of global energy consumption. Two key barriers that reduce the BIES operational efficiency mainly lie in the renewable generation uncertainty and operational non-convexity of combined heat and power (CHP) units. To this end, this paper proposes a soft actor-critic (SAC) algorithm to solve the scheduling problem of BIES, which overcomes the model non-convexity and shows advantages in robustness and generalization. This paper also adopts a temporal fusion transformer (TFT) to enhance the optimal solution for the SAC algorithm by forecasting the renewable generation and energy demand. The TFT can effectively capture the complex temporal patterns and dependencies that span multiple steps. Furthermore, its forecasting results are interpretable due to the employment of a self-attention layer so as to assist in more trustworthy decision-making in the SAC algorithm. The proposed hybrid data-driven approach integrating TFT and SAC algorithm, i.e., TFT-SAC approach, is trained and tested on a real-world dataset to validate its superior performance in reducing the energy cost and computational time compared with the benchmark approaches. The generalization performance for the scheduling policy, as well as the sensitivity analysis, are examined in the case studies.
THE rapid development in industry and urban areas has led to significant changes in energy systems, resulting in high renewable penetration and challenges for sustainable development. With buildings accounting for about 40% of global energy consumption, it is crucial to enhance the efficiency of building energy systems for meeting rising energy demands and supporting sustainability [
However, the optimal operation of BIES is hindered by two key challenges: ① the high operational risk due to the intermittent and uncertain nature of photovoltaic (PV) generation and energy demand [
BIESs have been extensively studied, particularly in the areas of scheduling [
While these conventional approaches are effective in managing the scheduling of multi-carrier energy systems, they face challenges in handling highly nonlinear units, particularly in competitive markets. Stochastic programming (SP) becomes inefficient as the number of scenarios increases, and RO often yields overly conservative results by focusing on the worst-case scenarios. Besides, both SP and RO suffer from the curse of dimensionality, where the increased actions, decision variables, and constraints lead to exponentially growing computational requirements, limiting their scalability for real-world applications involving multiple devices and uncertainties [
Reinforcement learning (RL) presents an innovative alternative that effectively addresses the above limitations by providing a means to tackling dynamic and sequential decision-making challenges [
Furthermore, by incorporating deep neural network (DNN), deep reinforcement learning (DRL) algorithms like deep deterministic policy gradient (DDPG) and twin delayed DDPG (TD3) can generate continuous actions and estimate the non-convex value functions. DRL algorithms outperform traditional RL algorithms and mathematical programming in solving optimization problems, offering lower computational burden and better applicability in real-world scenarios [
In the context of scheduling problems of BIESs, DRL algorithms receive available information to make operational decisions. The scheduling is based on day-ahead/hour-ahead prediction for required variables including renewable generation, energy demand, etc. Although some DRL algorithms can learn from the current state to make decisions, there is no explicit forecasting procedure in the design of DRL algorithms, resulting in a poor ability to deal with future uncertainties. Integrating decision-making with forecast for a holistic operational tool is a neutral idea to improve the operational efficiency. Recently, some literature has tended to integrate decision-making with forecast as a holistic data-driven tool for scheduling of integrated energy systems. For instance, [
The efficient scheduling of a BIES with handling non-convexity and uncertainties presents three major challenges. ① Traditional optimization approaches face significant difficulties in solving the operational optimization problem of BIES due to the inherent non-convexity of the devices. Moreover, as the system size increases, these approaches often become computationally prohibitive. ② Existing research on scheduling problems of BIES seldom integrates renewable energy forecasts with decision-making processes using data-driven approaches. Consequently, such comprehensive approaches remain underdeveloped and lack adaptability for specific BIES applications. ③ Many studies employ DRL algorithms in conjunction with black-box forecasting tools, raising concerns about the model transparency and reliability. The opacity of these algorithms can lead to significant profit losses [
To this end, our research addresses these gaps by integrating the TFT for accurate forecast with the SAC algorithm for robust operation. The main contributions of this paper are as follows.
1) This paper presents a detailed decision-making model for BIES, including micro-CHP unit, battery energy storage systems (BESSs), PV panels, and gas boilers (GBs). The non-convex scheduling problem is formulated into an optimization problem and then reformulated into an MDP for the application of RL algorithms.
2) This paper proposes a hybrid data-driven approach integrating TFT and SAC algorithm, i.e., TFT-SAC approach, to tackle the non-convex operational optimization problem in BIES. The TFT is used to forecast the renewable generation and energy demand based on historical data, and the obtained forecasts are then utilized by the SAC algorithm to solve the scheduling problems. Unlike conventional black-box forecasting methods, the TFT provides interpretability through the attention mechanism, enhancing the trustworthiness of forecasting results for decision-making. Furthermore, the SAC algorithm, trained to maximize the policy entropy, can learn an operational strategy with superior robustness and generalization capabilities.
3) The proposed TFT-SAC approach is trained and tested on a real-world dataset to validate its superior performance in reducing the energy cost and computational time compared with the benchmark approaches. The generalization performance for the learned scheduling policy and the sensitivity analysis are examined in various scenarios.
A comprehensive comparison between the proposed TFT-SAC approach and other approaches is presented in
Reference | Non-convex model | Forecast | Optimization | Solution algorithm | |||
---|---|---|---|---|---|---|---|
Model | Explainability | Robustness | Generalization | Computational efficiency | |||
[ | GRU-BLSTM | √ | RSO | ||||
[ | LSTM | RO | |||||
[ | ANN | Deterministic | |||||
[ | √ | √ | TD3 | ||||
[ | √ | CNN-BLSTM | √ | √ | DDPG | ||
[ | LSTM | √ | √ | √ | SAC | ||
This paper | √ | TFT | √ | √ | √ | √ | SAC |
Note: ANN, GRU, and RSO are short for artificial neural network, gated recurrent unit, and robust stochastic optimization, respectively.
The remainder of this paper is organized as follows. Section II covers the system description, device modeling, optimization problem, and MDP. Section III introduces the proposed hybrid data-driven approach integrating TFT and SAC algorithm. Section IV validates the proposed TFT-SAC approach with simulations, and Section V concludes this paper.
This study focuses on a modern BIES that encompasses grid-connected electric systems and independent heating systems, as illustrated in

Fig. 1 Illustration of BIES.
As shown in
Additionally, independent heating systems, consisting of micro-CHP units and GBs, are commonly deployed in building complexes, campuses, and industrial parks, particularly in regions with high heat demands. These localized heating systems reduce the significant transmission losses associated with centralized heating. The BIES model also assumes a connection to an external natural gas market as the fuel source for the micro-CHP units. Detailed models of these devices are provided as follows.
The micro-CHP unit is a crucial component of BIESs, functioning as a single-input multi-output energy converter. It is highly efficient in converting natural gas to power and heat, as a key element in enhancing the energy efficiency of BIES. Typically, the micro-CHP unit is modeled with constant energy conversion efficiencies for both power and heat. However, the generation of power and heat by micro-CHP unit is interdependent, resulting in an FOR. In this paper, we employ a non-convex operational model for micro-CHP unit. The non-convex FOR of this model is depicted in

Fig. 2 FOR of micro-CHP unit.
The mathematical representation of the FOR for the micro-CHP unit is given by (1), as detailed in [
(1a) |
(1b) |
(1c) |
(1d) |
(1e) |
(1f) |
(1g) |
(1h) |
where and are the output power and heat of micro-CHP unit at time , respectively; and are the generated power and heat of micro-CHP unit at point A, and those at other points B, C, D, E, and F are similarly defined; is a sufficiently large number used to assist in the model description; is the commitment status of the micro-CHP unit; is the set of operational hours; and and are the operating statuses in the convex subregions I and II, respectively. If the micro-CHP unit operates in the convex subregion I, and ; otherwise, and .
The total operation cost of micro-CHP unit at time is expressed as:
(2) |
where , , and are the cost coefficients.
The BESS is conceptualized as a battery capable of charging and discharging with distinct efficiencies. The operational strategy of BESS is designed with a granularity of one hour, corresponding to one time slot. This means that all charging and discharging activities of BESS within a time period are aggregated into a single operation. Consequently, the BESS can either charge or discharge in any given time slot, but not both simultaneously [
(3a) |
(3b) |
(3c) |
(3d) |
(3e) |
where is the state of charge (SoC) of BESS at time ; and are the predetermined loss factor and charging efficiency, respectively; and are the charging power and discharging power of BESS at time , respectively; and are the charging state and discharging state of BESS at time , respectively; and the subscripts max and min represent the maximum and minimum values of corresponding variables, respectively.
The SoC of BESS is calculated in (3a). The charging power and discharging power of BESS are constrained by (3b) and (3c), respectively. Constraint (3d) is employed to determine the charging or discharging state of BESS. The total capacity of BESS is constrained by (3e).
Considering all the models of devices in BIES presented above, the primary objective of BIES is to minimize the total cost of system operation. Specifically, the operational cost encompasses several components, including the cost of purchasing electricity and gas from the external markets (EMs), the degradation cost of BESSs, and the penalty incurred for unfulfilled energy demand. Consequently, the optimization problem for BIES operator can be formulated as:
(5a) |
(5b) |
(5c) |
(5d) |
where and are the power purchased from wholesale electricity and natural gas markets, respectively; and are the wholesale electricity and natural gas market prices, respectively; is the power output of PV penal; and and are the power and heat demands of the BIES, respectively. The set of decision variables is denoted as . The objective function aims to minimize the costs for purchasing electricity and operation of devices. Also, the objective is constrained by (1)-(4), and (5b)-(5d), where (1)-(4) are operating constraints for micro-CHP unit, BESS, and GB, and (5b)-(5d) indicate the multi-energy balance.
To optimize the decision-making process of BIES operator, we leverage an MDP to describe the optimization problem. We treat the BIES operator as an intelligent agent whose objective is to improve the operation decisions by minimizing the total cost in (5a). The MDP can be denoted by a tuple , where is the state, which encompasses electricity price , natural gas price , SoC of BESS , forecast of power demand , forecast of heat demand , and forecast of PV generation ; is the action, including the decision variables in (5); is the reward quantifying the agent performance, which is presented by the opposite of objective function in (5a); is the policy of MDP, which contains a series of actions for each state; and is the discount factor that discounts all rewards in the future state.
As the main objective of the agent is to identify the optimal policy that maximizes the accumulated return, we evaluate the value of each state using the state value function as given in (6). Moreover, the state-action value function that captures the joint value of a particular action at a state is demonstrated in (7).
(6) |
(7) |
where is the expectation function; and and are the initial state and action, respectively.
In this section, we introduce a novel TFT-SAC approach to solve the optimal scheduling problem of BIES. The structure of the proposed TFT-SAC approach is depicted in

Fig. 3 Structure of proposed TFT-SAC approach.
This subsection introduces the TFT model, i.e., an interpretable deep learning model designed for time-series forecast. The TFT model effectively captures complex temporal relationships and delivers reliable forecasts, which are essential for managing BIES. Specifically, the interpretability of the multi-head self-attention mechanism and VSN stems from its ability to assign VSN weight and attention weight to input data points, thereby visualizing the most influential time steps and features in the prediction process. Detailed algorithm design is covered in the following.
The TFT model generates quantile forecasts, which are particularly useful for estimating the uncertainty of future forecasts. Suppose there are I unique forecasting objects in a given time-series dataset, such as PV power generation, power demand, and heat demand. The quantile forecasts are obtained through a linear transformation of the outputs from the temporal fusion decoder. The mathematical representation of this process is given as:
(8) |
where is the
The training of TFT model involves minimizing the quantile loss [
(9) |
where is the quantile loss of single time series at the average prediction point, is the domain of training data containing samples, and is the weight of TFT model; yt is the actual data; is the prediction data; is the maximum step; and the function can be expressed as:
(10) |
where QL includes predicted values corresponding to different quantiles (e.g., 0.1, 0.5, and 0.9); and . To ensure consistency in prediction dimensions across different prediction points, the regularization is applied as:
(11) |
where is the domain of test samples; and qrisk is the normalized quantile losses across the entire forecasting horizon.
In the time-series forecast, especially with multiple regression, identifying relevant variables and the extent of non-linear processing is challenging. The TFT model uses gated residual networks (GRNs) for adaptive non-linear processing:
(12) |
(13) |
(14) |
(15) |
where is the layer normalization function; represents the linear and nonlinear contributions, with controlling the degree of nonlinearity, and is the vector of primary inputs to GRN; is an optional context vector; is the activation function of exponential linear unit; is the sigmoid activation function; , , , , and are the weight sharing indices; and , , , and are the bias sharing indices. The GRN layer is controlled by the GLU layer, which may skip the layer entirely if GLU outputs are close to 0.
The VSN is a key component of the TFT that improves the performance by selecting important features and filtering out noises. It assigns weights to features, which are used to combine the processed inputs:
(16) |
where is the weight corresponding to features; is the flattened vector; and is obtained from the static covariate encoder. The processed features are weighted by their corresponding variable selection weights and then combined.
The TFT model employs a temporal self-attention layer that plays a key role in capturing long-term dependencies in time-series data. This layer not only improves the model ability to understand complex temporal relationships but also enhances the interpretability of forecasts. The self-attention layer used here is a masked and interpretable multi-head attention layer combined with a gating mechanism to selectively control information flow.
The core concept behind the temporal self-attention layer is to calculate the relevance, or “attention”, of different time steps to each other, enabling the TFT model to focus on important events or sequences within the data. This is done using the following equation for attention:
(17) |
where V is the value of input based on the similarity between the query vector and key vector ; and is a normalization function that determines the attention weights of value V. The scaled dot-product mechanism for calculating attention is defined as:
(18) |
where is the dimension of attention layer.
Multi-head self-attention mechanism enhances the power of the self-attention mechanism by allowing the model to jointly focus on information from different representation subspaces at different positions. Instead of using a single set of queries, keys, and values, the multi-head self-attention mechanism splits them into multiple sets, each of which is processed independently. Each head computes attention separately, and the results are then concatenated and linearly transformed to produce the final output. By having multiple heads, the TFT model can capture a richer set of relationships and nuances in the data compared with a self-attention mechanism, which are presented as:
(19) |
(20) |
where , , and are the head-specific weights for queries, keys, and values, respectively, and and are the dimensions of model and weight, respectively; and linearly combines outputs concatenated from all heads (), and mH is the number of heads.
One of the main issues with traditional multi-head attention mechanism is that each head uses different value vectors, making it difficult to directly determine the feature importance from the attention weights. By modifying the mechanism to share the same value vector across all heads, the TFT model can produce a unified set of attention weights, thereby improving interpretability:
(21) |
(22) |
where is the interpretable multi-head; denotes the final linear mapping used across ; and is the value weight shared across all heads. Compared with in (18), this modification allows each attention head to share the same set of values , resulting in a single and interpretable set of attention scores that can be analyzed to determine feature importance [
In this subsection, we describe the SAC algorithm, which is a state-of-the-art maximum-entropy-based off-policy DRL algorithm, to solve the optimization problem of BIES. Typical DRL algorithms generally suffer from limited robustness in real-world applications due to ineffective exploration [
As a DRL algorithm with an actor-critic structure, the SAC algorithm outperforms most algorithms, e.g., DDPG, in convergence performance. SAC algorithm maximizes both accumulative rewards and policy entropy. The entropy function is defined in (23), where is the strategy condition to the state . The state value function and state-action value function are presented in (24) and (25), respectively, where the temperature parameter determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy [
(23) |
(24) |
(25) |
At the same time, the state value function can be presented as (26) according to (23) and (24).
(26) |
(27) |
where , guaranteeing that is a valid probabilistic distribution on the action space; and is the function for taking all each action in the state . When the Q value converges to the optima, the optimal policy achieves the optimal state value function. Therefore, the updating of Q-value function can be realized by using the closed-form solution in an off-policy scheme.
The SAC algorithm adopts an actor-critic structure with DNNs to estimate the policy (actor) and Q-value functions (critic). The actor network is represented by the policy function parameterized by . The critic employs clipped double Q networks and and their target networks and . Therefore, the target for the Q value is expressed as (28). Then, the L2 loss is used to update the Q-network in (29) for .
(28) |
(29) |
where is the action under the current policy in the next state ; is the set of mini batches indexed by ; and is the executed policy.
To train these networks, the agent randomly samples tuples from the ERB to form a mini batch for experience replay learning. The online critic networks are updated by one step of gradient descent to the mean square error (MSE) in (29), while the actor network is updated by one step of gradient ascent using (30). To stabilize the training, the target network parameters are soft updated with (31).
(30) |
(31) |
where is a sample from ; and is the soft update parameter.
The use of the proposed TFT-SAC approach is unique and effective for the dynamic operation and control of BIES. This combination offers several advantages but also has potential shortcomings compared with other traditional approaches.
1) Integrated forecasting and operation: the TFT provides accurate and data-driven forecasts of PV generation and energy demand, which allows the SAC algorithm to make informed decisions. This integration reduces uncertainty in the decision-making process, leading to more reliable system operations.
2) Offline training and efficient online operation: the proposed TFT-SAC approach allows for offline training using historical data, enabling the development of a robust policy before deployment. Once trained, the algorithm operates in real time with minimal computational overhead, which is a significant advantage over approaches like SO or RO that require repeated recalculation.
3) Handling non-convexity: the operation of BIES involves non-convex constraints such as the FOR. The SAC algorithm, leveraging DNNs, can effectively learn non-convex optimal operating policies due to the powerful representation capabilities of DNNs. In comparison, traditional mathematical programming approaches such as mixed-integer linear programming (MILP) address non-convexity by linearizing nonlinear relationships and explicitly formulating integer constraints, facing scalability and computational challenges particularly in large and dynamic systems like BIES. Heuristic algorithms can explore complex optimization landscapes and are often more flexible than mathematical programming. However, they may suffer from high computational demands, especially in large-scale systems, and may converge to local optima rather than finding the global solution.
4) Training complexity: the proposed TFT-SAC approach requires extensive offline training, which can be computationally expensive and time-consuming, particularly for large datasets. The performance highly relies to a high-quality training dataset, which is typically hard to acquire in the real world.
5) Dependence on forecasting accuracy: the effectiveness of SAC algorithm in making optimal decisions depends heavily on the forecasting accuracy provided by TFT. If the forecasts are inaccurate due to unexpected external factors, the quality of the operational decisions may be compromised.
Overall, the proposed TFT-SAC approach provides an effective solution for BIES operation. The integrated forecast and optimized structure, capability to handle non-convexity, and efficient implementation make this approach a compelling alternative to traditional approaches, despite some challenges related to training complexity and dependence on forecasting accuracy.
To validate the effectiveness of the proposed TFT-SAC approach, we conduct case studies using data from a real building located in Zhenjiang, China. The BIES under study comprises devices like a micro-CHP unit, PV panels, BESSs, and GBs to meet both heat and power demands.
The micro-CHP unit, with a rated output of 25.3 kW, is designed to satisfy the heat demand of the building while partially covering its power demand. The PV system includes 610 PV panels, each with a capacity of 280 W, resulting in a theoretical maximum output of 170.8 kW. However, due to practical limitations, the actual capacity is 153 kW. The BESS consists of 24 LiFePO4 batteries, each with a storage capacity of 5.12 kWh, providing a maximum output of 72 kW. This setup enables the BESS to support peak power demand for up to 4 hours. Detailed information of micro-CHP unit and BESS is shown in Supplementary Material A.
The proposed TFT-SAC approach is implemented in Python, and the neural networks are developed using PyTorch [
Neutral network | Number of hidden layers | Number of neurons | Learning rate | Soft update parameter | Optimizer |
---|---|---|---|---|---|
Actor | 3 | [512, 32] |
1×1 |
11×1 | Adam |
Critic | 2 | [512, 32] |
1×1 |
11×1 | Adam |
Training parameter | Number |
---|---|
Replay buffer size |
1×1 |
Replay start size | 128 |
Batch size | 128 |
Discount factor | 0.99 |
Parameter | Forecast of energy demand | Forecast of PV generation |
---|---|---|
Learning rate |
1×1 |
3.5×1 |
Grad clip value | 0.1 | 0.9 |
Patience | 10 | 2 |
Batch size | 16 | 16 |
Drop out | 0.2 | 0.1 |
Time step | 168 | 24 |
Hidden size | 128 | 32 |
Number of LSTM layers | 6 | 4 |
Number of attention heads | 6 | 3 |
Loss function | Quantile loss | Quantile loss |
This subsection compares the SAC algorithm with baseline algorithms TD3 and DDPG. Each algorithm is trained for 10000 episodes on sampled days from the training set.

Fig. 4 Episodic reward evolution of different algorithms during offline training process.
To evaluate the performance of the proposed TFT-SAC approach, we use the trained actor network parameters to generate operational strategies for the BIES over 50 test days. We compare this forecasting-combined RL approach with benchmark approaches: typical RL approaches (TD3, DDPG, and SAC) and another forecasting-combined RL approach (LSTM-SAC).

Fig. 5 Cumulative cost for energy consumption with different approaches over 50 test days.
As can be known from
Forecast object | Model | MAE | RMSE | R² |
---|---|---|---|---|
PV generation | LSTM | 3.66 | 12.23 | 0.8402 |
TFT | 5.22 | 11.24 | 0.8721 | |
Energy demand | LSTM | 3.37 | 4.60 | 0.9407 |
TFT | 2.20 | 3.26 | 0.9670 |
Figures

Fig. 6 Performance of LSTM and TFT models in PV generation forecasting.

Fig. 7 Performance of LSTM and TFT models in energy demand forecasting.
The meteorological data include net solar irradiation (NSI), solar irradiation (SI), ultraviolet (UV), outdoor air temperature (OAT), rainfall (RF), relative humidity (RH), temperature-humidity-wind (THW), and surface air temperature (SAT).

Fig. 8 Relative importance of different features in TFT model for forecasting PV generation. (a) Encoder. (b) Decoder.

Fig. 9 Relative importance of different features in TFT model for forecasting energy demand. (a) Encoder. (b) Decoder.
The importance ranking reveals that the TFT model considers both weather conditions and temporal attributes to accurately forecast energy demands. This is crucial because user activities are often influenced by the time of day or specific events on the calendar, and these behavioral patterns significantly affect energy usage in buildings. The model attention to these aspects shows its ability to learn from diverse data sources and focuses on the most impactful features during the training process, resulting in a more reliable forecast.
Figures

Fig. 10 Attention of TFT model over past 7 days for forecasting PV generation.

Fig. 11 Attention of TFT model over past 7 days for forecasting energy demand.
In comparison, the TFT model for PV generation forecasting focuses on recent time steps due to daily cyclic patterns, while that for forecasting energy demands has a broad attention span over the entire historical cycle, balancing long-term trends and short-term impacts. The gradual increase in attention weights indicates the emphasis on recent information for imminent forecasts.
The uniform attention distribution for energy demand suggests its cyclical patterns are less pronounced or more complex than that of PV generation. This highlights the importance of extracting information from multiple time scales for accurate forecasts and underscores the need for effective energy management strategies to optimize BIES operational efficiency.
In summary, the TFT model provides accurate and interpretable forecasts for both PV generation and energy demand, supporting the RL algorithm in formulating efficient scheduling strategies.
To validate the generalization performance, different approaches are tested over a test set that shows different statistical characteristics compared with the training set. The test set is represented by several typical weeks labeled W-1 to W-4 for comparative analysis. These typical weeks include scenarios with extreme PV generation or energy demand.
Week | Daily average operational cost (¥) | ||||
---|---|---|---|---|---|
DDPG | TD3 | SAC | LSTM-SAC | TFT-SAC | |
W-1 | 500.14 | 499.30 | 490.19 | 328.02 | 325.79 |
W-2 | 361.75 | 361.20 | 347.92 | 232.76 | 231.60 |
W-3 | 450.34 | 449.66 | 431.40 | 318.91 | 311.03 |
W-4 | 733.25 | 732.44 | 715.75 | 521.30 | 520.99 |
To compare the robustness of the proposed TFT-SAC approach with other RL approaches, we introduce independent Gaussian noises to real PV generation and energy demand to represent uncertain scenarios. The average daily operational costs of BIES at different noise levels are presented in
Noise levelN | Daily average operational cost (¥) | ||||
---|---|---|---|---|---|
DDPG | TD3 | SAC | LSTM-SAC | TFT-SAC | |
0.01 | 596.07 | 557.56 | 557.49 | 505.12 | 490.04 |
0.02 | 596.38 | 558.24 | 558.18 | 505.82 | 491.88 |
0.03 | 597.37 | 559.02 | 558.96 | 506.62 | 494.91 |
0.04 | 599.80 | 559.85 | 559.78 | 507.47 | 495.13 |
0.05 | 603.91 | 560.73 | 560.66 | 508.38 | 495.17 |
Across all noise levels, the typical RL approaches incur significantly higher operational costs than forecasting-combined RL approaches, with cost differences ranging from ¥60 to ¥100. Among all the tested approaches, the proposed TFT-SAC approach demonstrates the lowest average operational costs, indicating superior robustness. However, the cost variations between the proposed TFT-SAC approach and LSTM-SAC remain small in the range of ¥10 and . In contrast, the cost difference of the proposed TFT-SAC approach with and is approximately , and that of TD3, SAC, and LSTM-SAC is . This larger cost variation suggests that proposed TFT-SAC approach is more sensitive to forecasting accuracy than other approaches, even though it consistently achieves the lowest average operational costs among all approaches.
To evaluate the generalization of the optimal energy management policy learned by the proposed TFT-SAC approach, we apply two typical scenarios: a summer day (August 27) and a winter day (December 25). Figures

Fig. 12 Power generation and consumption of BIES. (a) A summer day. (b) A winter day.

Fig. 13 Heat generation and consumption of BIES. (a) A summer day. (b) A typical day.
Both scenarios share common trends. Initially, from 00:00 to 08:00, the BIES purchases electricity due to zero PV generation and low SoC of BESS. BESS charges at low prices for future demands. From 09:00 to 15:00, PV generation and BESS discharging could meet most power demands, with excess power sold with high electricity prices. From 18:00 to 24:00, the BIES does not sell electricity, and micro-CHP unit becomes the primary power source due to high demand.
Nevertheless, there are some evident differences between the two typical days. On the winter day, the micro-CHP unit operates from 09:00 to 15:00 to meet high heat demands and support the power demands due to low PV generation. On the summer day, the micro-CHP unit is inactive as PV and BESS can meet the demands and the excess power is sold. The policy effectively uses micro-CHP unit in winter and BESS in summer, charging at low prices and discharging at high prices to maximize the economic benefits.
Finally, it can be concluded that the proposed TFT-SAC approach can learn an effective policy and can generalize to variable state information on different test days. Also, the flexibility of BIES is investigated on two typical winter and summer days. Specifically, the summer day has a higher PV generation and lower heat demand, so it has a higher energy export and makes use of more flexibility of BIES. Due to the lower PV generation and higher heat demand, the winter day has a higher power import and a higher utilization of the micro-CHP unit, which also provides lots of flexibility to BIES.
In this subsection, a detailed sensitivity analysis is conducted to evaluate the impact of changes in key factors on the operation and performance of BIES. Specifically, we analyze the sensitivities of the episodic reward to variations in electricity price, PV generation, power demand, and heat demand, as shown in

Fig. 14 Sensitivity analysis on several factor.
The sensitivity analysis is performed by varying each parameter independently from 90% to 110% of the initial configured value, with a granularity of 5%. This range is selected to represent potential fluctuations in market and operational conditions, and the granularity is chosen to provide a balanced level of detail without excessive computational overhead.
The results in
Interestingly, the power demand has a greater effect on the episodic reward compared with PV generation. This is because the total daily PV generation is lower than the total power demand. As a result, any decrease in power demand has a larger marginal impact on profitability, either through reduced procurement or allowing more energy to be sold during peak periods.
In terms of scheduling policies, the changes in power demand and PV generation lead to noticeable shifts in action prioritization. For instance, the increased PV generation results in more frequent utilization of BESS for energy arbitrage, while fluctuations in electricity price affect decisions regarding energy procurement timing. These findings emphasize the importance of accurate forecasts for PV generation and energy demand in effectively optimizing the operational strategies of BIES.
In conclusion, this paper develops a novel hybrid data-driven approach, i.e., TFT-SAC approach, for the optimal scheduling in BIES. Specifically, the TFT model enhances the forecasting accuracy and transparency through attention mechanisms and the VSN, enhancing interpretability and trustworthy of forecasting results. The integration of SAC algorithm for optimization further strengthens this framework by ensuring more effective exploration during training, leading to stronger robustness and generalization capabilities. Simulation results demonstrate the superior performance of the proposed TFT-SAC approach compared with the existing approaches. The interpretability of the TFT model and the generalization performance of SAC algorithm are analyzed. The sensitivity analysis of reward on several key factors in BIES is also made.
References
X. Cao, X. Dai, and J. Liu, “Building energy-consumption status worldwide and the state-of-the-art technologies for zero-energy buildings during the past decade,” Energy and Buildings, vol. 128, pp. 198-213, Sept. 2016. [Baidu Scholar]
W. Wu, P. Li, B. Wang et al., “Integrated distribution management system: architecture, functions, and application in China,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 2, pp. 245-258, Mar. 2022. [Baidu Scholar]
H. Qiu, V. Veerasamy, C. Ning et al., “Two-stage robust optimization for assessment of PV hosting capacity based on decision-dependent uncertainty,” Journal of Modern Power Systems and Clean Energy, vol. 12, no. 6, pp. 2091-2096, Nov. 2024. [Baidu Scholar]
X. Huang, Z. Xu, Y. Sun et al., “Heat and power load dispatching considering energy storage of district heating system and electric boilers,” Journal of Modern Power Systems and Clean Energy, vol. 6, no. 5, pp. 992-1003, Nov. 2018. [Baidu Scholar]
C. Huang, H. Zhang, L. Wang et al., “Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 743-754, May 2022. [Baidu Scholar]
H. Zhao, B. Wang, X. Wang et al., “Active dynamic aggregation model for distributed integrated energy system as virtual power plant,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 5, pp. 831-840, Sept. 2020. [Baidu Scholar]
M. Sechilariu, B. Wang, and F. Locment, “Building integrated photovoltaic system with energy storage and smart grid communication,” IEEE Transactions on Industrial Electronics, vol. 60, no. 4, pp. 1607-1618, Apr. 2013. [Baidu Scholar]
Y. Li, C. Wang, G. Li et al., “Improving operational flexibility of integrated energy system with uncertain renewable generations considering thermal inertia of buildings,” Energy Conversion and Management, vol. 207, p. 112526, Mar. 2020. [Baidu Scholar]
R. Jing, M. Wang, Z. Zhang et al., “Comparative study of posteriori decision-making methods when designing building integrated energy systems with multi-objectives,” Energy and Buildings, vol. 194, pp. 123-139, Jul. 2019. [Baidu Scholar]
Y. Zhang, P. E. Campana, A. Lundblad et al., “Planning and operation of an integrated energy system in a Swedish building,” Energy Conversion and Management, vol. 199, p. 111920, Nov. 2019. [Baidu Scholar]
Z. Zhu, Z. Hu, K. W. Chan et al., “Reinforcement learning in deregulated energy market: a comprehensive review,” Applied Energy, vol. 329, p. 120212, Jan. 2023. [Baidu Scholar]
A. Dolatabadi, H. Abdeltawab, and Y. A. I. Mohamed, “A novel model-free deep reinforcement learning framework for energy management of a PV integrated energy hub,” IEEE Transactions on Power Systems, vol. 38, no. 5, pp. 4840-4852, Sept. 2023. [Baidu Scholar]
D. Qiu, Z. Dong, X. Zhang et al., “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Applied Energy, vol. 309, p. 118403, Mar. 2022. [Baidu Scholar]
Z. Zhu, K. W. Chan, S. Xia et al., “Optimal bi-level bidding and dispatching strategy between active distribution network and virtual alliances using distributed robust multi-agent deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 13, no. 4, pp. 2833-2843, Jul. 2022. [Baidu Scholar]
Y. Zhou, Z. Ma, J. Zhang et al., “Data-driven stochastic energy management of multi energy system using deep reinforcement learning,” Energy, vol. 261, p. 125187, Dec. 2022. [Baidu Scholar]
Z. Hu, K. W. Chan, Z. Zhu et al., “Techno-economic modeling and safe operational optimization of multi-network constrained integrated community energy systems,” Advances in Applied Energy, vol. 15, p. 100183, Sept. 2024. [Baidu Scholar]
Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]
D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]
Q. Ma and C. Deng, “Simplified deep reinforcement learning based volt-var control of topologically variable power system,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1396-1404, Sept. 2023. [Baidu Scholar]
Y. Wang, M. Mao, L. Chang et al., “Intelligent voltage control method in active distribution networks based on averaged weighted double deep Q-network algorithm,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 132-143, Jan. 2023. [Baidu Scholar]
B. Lim, S. Ö. Arık, N. Loeff et al., “Temporal fusion transformers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748-1764, Oct. 2021. [Baidu Scholar]
W. J. von Eschenbach, “Transparency and the black box problem: why we do not trust AI,” Philosophy & Technology, vol. 34, no. 4, pp. 1607-1622, Sept. 2021. [Baidu Scholar]
T. M. Alabi, L. Lu, and Z. Yang, “Data-driven optimal scheduling of multi-energy system virtual power plant (MEVPP) incorporating carbon capture system (CCS), electric vehicle flexibility, and clean energy marketer (CEM) strategy,” Applied Energy, vol. 314, p. 118997, May 2022. [Baidu Scholar]
S. Zhou, D. He, Z. Zhang et al., “A data-driven scheduling approach for hydrogen penetrated energy system using LSTM network,” Sustainability, vol. 11, no. 23, p. 6784, Dec. 2019. [Baidu Scholar]
A. Kämper, R. Delorme, L. Leenders et al., “Boosting operational optimization of multi-energy systems by artificial neural nets,” Computers & Chemical Engineering, vol. 173, p. 108208, May 2023. [Baidu Scholar]
Y. Xu, W. Gao, Y. Li et al., “Operational optimization for the grid-connected residential photovoltaic-battery system using model-based reinforcement learning,” Journal of Building Engineering, vol. 73, p. 106774, Aug. 2023. [Baidu Scholar]
G. Pan, W. Gu, Y. Lu et al., “Optimal planning for electricity-hydrogen integrated energy system considering power to hydrogen and heat and seasonal storage,” IEEE Transactions on Sustainable Energy, vol. 11, no. 4, pp. 2662-2676, Oct. 2020. [Baidu Scholar]
R. Wen, K. Torkkola, B. Narayanaswamy et al. (2017, Nov.). A multi-horizon quantile recurrent forecast. [Online]. Available: https://arxiv.org/abs/1711.11053 [Baidu Scholar]
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 1-10, Aug. 2017. [Baidu Scholar]
T. Haarnoja, A. Zhou, K. Hartikainen et al. (2018, Jan.). Soft actor-critic algorithms and applications. [Online]. Available: https://arxiv.org/abs/1812.05905. [Baidu Scholar]
A. Paszke, S. Gross, F. Massa et al., “Pytorch: an imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, pp. 1-12, Dec. 2019. [Baidu Scholar]