A Hybrid Data-driven Approach Integrating Temporal Fusion Transformer and Soft Actor-critic Algorithm for Optimal Scheduling of Building Integrated Energy Systems

Ze Hu; Peijun Zheng; Ka Wing Chan; Siqi Bu; Ziqing Zhu; Xiang Wei; Yosuke Nakanishi

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

A Hybrid Data-driven Approach Integrating Temporal Fusion Transformer and Soft Actor-critic Algorithm for Optimal Scheduling of Building Integrated Energy Systems PDF

- ORCID：
Ze Hu ¹ (Student Member, IEEE)
✉
- ORCID：
Peijun Zheng ² (Member, IEEE)
✉
- ORCID：
Ka Wing Chan ¹
✉
- ORCID：
Siqi Bu ¹ (Senior Member, IEEE)
✉
- ORCID：
Ziqing Zhu ¹ (Member, IEEE)
✉
- ORCID：
Xiang Wei ¹
✉
- ORCID：
Yosuke Nakanishi ² (Member, IEEE)
✉

1. Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong, China； 2. Graduate School of Environment and Energy Engineering, Waseda University, Tokyo, Japan

Updated：2025-05-21

DOI：10.35833/MPCE.2024.000909

Abstract

Building integrated energy systems (BIESs) are pivotal for enhancing energy efficiency by accounting for a significant proportion of global energy consumption. Two key barriers that reduce the BIES operational efficiency mainly lie in the renewable generation uncertainty and operational non-convexity of combined heat and power (CHP) units. To this end, this paper proposes a soft actor-critic (SAC) algorithm to solve the scheduling problem of BIES, which overcomes the model non-convexity and shows advantages in robustness and generalization. This paper also adopts a temporal fusion transformer (TFT) to enhance the optimal solution for the SAC algorithm by forecasting the renewable generation and energy demand. The TFT can effectively capture the complex temporal patterns and dependencies that span multiple steps. Furthermore, its forecasting results are interpretable due to the employment of a self-attention layer so as to assist in more trustworthy decision-making in the SAC algorithm. The proposed hybrid data-driven approach integrating TFT and SAC algorithm, i.e., TFT-SAC approach, is trained and tested on a real-world dataset to validate its superior performance in reducing the energy cost and computational time compared with the benchmark approaches. The generalization performance for the scheduling policy, as well as the sensitivity analysis, are examined in the case studies.

Keywords

Building integrated energy system (BIES); hybrid data-driven approach; time-series forecast; optimal scheduling; soft actor-critic (SAC); temporal fusion transformer (TFT).

I. Introduction

THE rapid development in industry and urban areas has led to significant changes in energy systems, resulting in high renewable penetration and challenges for sustainable development. With buildings accounting for about 40% of global energy consumption, it is crucial to enhance the efficiency of building energy systems for meeting rising energy demands and supporting sustainability [

1]. Energy integration provides a way to improve the operation efficiency in the distribution-level energy system [2]. By coordinating multiple energies including power, gas, and heat, the building integrated energy systems (BIESs) can achieve sufficient renewable usage and obtain more abundant flexibility. Therefore, an effective energy management in BIES becomes vital for improving the operational flexibility and maximizing the renewable energy use in the whole energy system.

However, the optimal operation of BIES is hindered by two key challenges: ① the high operational risk due to the intermittent and uncertain nature of photovoltaic (PV) generation and energy demand [

3], and ② intractable optimization caused by non-convexity of combined heat and power (CHP) unit [4]. For the former, the PV generation and energy demand uncertainties have been shown to bring significant profit losses and endanger the system stability by leading to energy shortage or renewable curtailment [5]. This problem even gets more severe in large buildings with high peak demands or high solar capacities. The accurate forecast of PV generation and energy demand is thus crucial for smart scheduling of energy devices (e.g., energy storage) to avoid profit losses and system blackouts. As for the latter, the CHP unit is well known for providing flexibility in power and heat in a feasible operation region (FOR), which is non-convex in practice and makes the optimization non-tractable. The FOR convexification is a widely adopted solution but sacrifices considerable operational flexibility [6]. The optimal scheduling of CHP unit remains an open question in the optimal operation of BIESs. Moreover, the variable renewable/demand forecast and non-convex operation optimization are not independent of each other, e.g., the flexible scheduling of CHP unit can provide compensation for the renewable uncertainty. This indicates a deep correlation between the forecast and non-convex scheduling in BIESs.

BIESs have been extensively studied, particularly in the areas of scheduling [

7], [8] and expansion planning [9], [10]. Much of the existing research has focused on developing model-based frameworks for the optimal operation in multi-carrier energy systems. These optimization problems generally rely on precise models and estimated exogenous factors such as weather-dependent renewable generation and energy loads. To address uncertainties, approaches like robust optimization (RO) and stochastic optimization (SO) have been used, where RO models the uncertainties as bounded sets, and SO uses a set of scenarios to represent the uncertainties.

While these conventional approaches are effective in managing the scheduling of multi-carrier energy systems, they face challenges in handling highly nonlinear units, particularly in competitive markets. Stochastic programming (SP) becomes inefficient as the number of scenarios increases, and RO often yields overly conservative results by focusing on the worst-case scenarios. Besides, both SP and RO suffer from the curse of dimensionality, where the increased actions, decision variables, and constraints lead to exponentially growing computational requirements, limiting their scalability for real-world applications involving multiple devices and uncertainties [

11].

Reinforcement learning (RL) presents an innovative alternative that effectively addresses the above limitations by providing a means to tackling dynamic and sequential decision-making challenges [

12]. Unlike classical optimization approaches, which are typically static and require the modeling of joint uncertainty distributions across time periods, RL models the optimization problems as Markov decision processes (MDPs). This makes RL well-suited for navigating complex, nonlinear, and uncertain environments, allowing for real-time adaptation and learning. Moreover, the training process of RL agents is grounded in dynamic programming (DP) to make real-time sequential decisions and consider long-term returns [11]. It can efficiently utilize increasing data from the environment, adapting to various state conditions and capturing system uncertainties [13].

Furthermore, by incorporating deep neural network (DNN), deep reinforcement learning (DRL) algorithms like deep deterministic policy gradient (DDPG) and twin delayed DDPG (TD3) can generate continuous actions and estimate the non-convex value functions. DRL algorithms outperform traditional RL algorithms and mathematical programming in solving optimization problems, offering lower computational burden and better applicability in real-world scenarios [

14]. Therefore, DRL algorithms have been adopted and considered effective tools for most optimization and decision-making problems including scheduling of integrated energy systems [15], [16], optimal power flow calculation [17], [18], and voltage control [19], [20], etc.

In the context of scheduling problems of BIESs, DRL algorithms receive available information to make operational decisions. The scheduling is based on day-ahead/hour-ahead prediction for required variables including renewable generation, energy demand, etc. Although some DRL algorithms can learn from the current state to make decisions, there is no explicit forecasting procedure in the design of DRL algorithms, resulting in a poor ability to deal with future uncertainties. Integrating decision-making with forecast for a holistic operational tool is a neutral idea to improve the operational efficiency. Recently, some literature has tended to integrate decision-making with forecast as a holistic data-driven tool for scheduling of integrated energy systems. For instance, [

15] adopts a long short-term memory (LSTM) method to extract temporal features and assist the decision-making of DRL algorithms in integrated energy management. Reference [12] combines a convolutional neural network (CNN) and bidirectional LSTM (BLSTM) to forecast PV generation in an energy hub by analyzing sky images. The predicted value is then imported into the DDPG for further scheduling decision-making. Although these approaches have shown good performance, the LSTM struggles with capturing complex temporal patterns and dependencies that span multiple time steps effectively [21], and related research is still limited.

The efficient scheduling of a BIES with handling non-convexity and uncertainties presents three major challenges. ① Traditional optimization approaches face significant difficulties in solving the operational optimization problem of BIES due to the inherent non-convexity of the devices. Moreover, as the system size increases, these approaches often become computationally prohibitive. ② Existing research on scheduling problems of BIES seldom integrates renewable energy forecasts with decision-making processes using data-driven approaches. Consequently, such comprehensive approaches remain underdeveloped and lack adaptability for specific BIES applications. ③ Many studies employ DRL algorithms in conjunction with black-box forecasting tools, raising concerns about the model transparency and reliability. The opacity of these algorithms can lead to significant profit losses [

22], thereby limiting the real-world applicability of data-driven approaches.

To this end, our research addresses these gaps by integrating the TFT for accurate forecast with the SAC algorithm for robust operation. The main contributions of this paper are as follows.

1) This paper presents a detailed decision-making model for BIES, including micro-CHP unit, battery energy storage systems (BESSs), PV panels, and gas boilers (GBs). The non-convex scheduling problem is formulated into an optimization problem and then reformulated into an MDP for the application of RL algorithms.

2) This paper proposes a hybrid data-driven approach integrating TFT and SAC algorithm, i.e., TFT-SAC approach, to tackle the non-convex operational optimization problem in BIES. The TFT is used to forecast the renewable generation and energy demand based on historical data, and the obtained forecasts are then utilized by the SAC algorithm to solve the scheduling problems. Unlike conventional black-box forecasting methods, the TFT provides interpretability through the attention mechanism, enhancing the trustworthiness of forecasting results for decision-making. Furthermore, the SAC algorithm, trained to maximize the policy entropy, can learn an operational strategy with superior robustness and generalization capabilities.

3) The proposed TFT-SAC approach is trained and tested on a real-world dataset to validate its superior performance in reducing the energy cost and computational time compared with the benchmark approaches. The generalization performance for the learned scheduling policy and the sensitivity analysis are examined in various scenarios.

A comprehensive comparison between the proposed TFT-SAC approach and other approaches is presented in Table I.

TABLE I Comparison Between Proposed TFT-SAC Approach and Other Approaches

Reference	Non-convex model	Forecast		Optimization			Solution algorithm
Reference	Non-convex model	Model	Explainability	Robustness	Generalization	Computational efficiency	Solution algorithm
[23]		GRU-BLSTM		√			RSO
[24]		LSTM					RO
[25]		ANN					Deterministic
[26]					√	√	TD3
[12]	√	CNN-BLSTM			√	√	DDPG
[15]		LSTM		√	√	√	SAC
This paper	√	TFT	√	√	√	√	SAC

Note: ANN, GRU, and RSO are short for artificial neural network, gated recurrent unit, and robust stochastic optimization, respectively.

The remainder of this paper is organized as follows. Section II covers the system description, device modeling, optimization problem, and MDP. Section III introduces the proposed hybrid data-driven approach integrating TFT and SAC algorithm. Section IV validates the proposed TFT-SAC approach with simulations, and Section V concludes this paper.

II. Problem Formulation

A. System Description

This study focuses on a modern BIES that encompasses grid-connected electric systems and independent heating systems, as illustrated in Fig. 1. In practice, such systems can be found in university campuses, residential complexes, and industrial parks.

Fig. 1 Illustration of BIES.

As shown in Fig. 1, the BIES operates to meet multiple energy demands using both internal energy devices and external energy resources. Specifically, the electric system, which comprises PV panels, micro-CHP unit, and BESSs, is grid-connected to satisfy the power demands of building. Typically, BIESs purchase electricity from the external electricity market when the demand exceeds renewable generation and may sell electricity when renewable generation is surplus. The BESS enhances the operational flexibility and adds complexity to the decision-making process. PV and BESS, as components of DC systems, are connected to the building and power grid through electronic interfaces. The maximum power point tracking (MPPT) is used to control the inverter between the DC and AC systems, maximizing energy extraction from PV panels despite fluctuating solar conditions. For the purposes of this paper, the dynamics inside the power converters are neglected, as the focus is on optimizing the hourly operational strategy.

Additionally, independent heating systems, consisting of micro-CHP units and GBs, are commonly deployed in building complexes, campuses, and industrial parks, particularly in regions with high heat demands. These localized heating systems reduce the significant transmission losses associated with centralized heating. The BIES model also assumes a connection to an external natural gas market as the fuel source for the micro-CHP units. Detailed models of these devices are provided as follows.

B. Device Modeling

1)　Micro-CHP Unit Modeling

The micro-CHP unit is a crucial component of BIESs, functioning as a single-input multi-output energy converter. It is highly efficient in converting natural gas to power and heat, as a key element in enhancing the energy efficiency of BIES. Typically, the micro-CHP unit is modeled with constant energy conversion efficiencies for both power and heat. However, the generation of power and heat by micro-CHP unit is interdependent, resulting in an FOR. In this paper, we employ a non-convex operational model for micro-CHP unit. The non-convex FOR of this model is depicted in Fig. 2, bounded by the curve ABCDEF. This FOR is considered to comprise two convex subregions, labeled as I and II.

Fig. 2 FOR of micro-CHP unit.

The mathematical representation of the FOR for the micro-CHP unit is given by (1), as detailed in [

12].

P_{C H P, e}^{t} - P_{C H P, e}^{B} - \frac{P_{C H P, e}^{B} - P_{C H P, e}^{C}}{P_{C H P, h}^{B} - P_{C H P, h}^{C}} (P_{C H P, h}^{t} - P_{C H P, h}^{B}) \leq 0 \forall t \in T

(1a)

P_{C H P, e}^{t} - P_{C H P, e}^{C} - \frac{P_{C H P, e}^{C} - P_{C H P, e}^{D}}{P_{C H P, h}^{C} - P_{C H P, h}^{D}} (P_{C H P, h}^{t} - P_{C H P, h}^{C}) \leq 0 \forall t \in T

(1b)

- (1 - {\bar{X}}_{C H P}^{t}) Γ \leq P_{C H P, e}^{t} - P_{C H P, e}^{E} - \frac{P_{C H P, e}^{E} - P_{C H P, e}^{F}}{P_{C H P, h}^{E} - P_{C H P, h}^{F}} \cdot (P_{C H P, h}^{t} - P_{C H P, h}^{E}) \forall t \in T

(1c)

- (1 - {\underset{̲}{X}}_{C H P}^{t}) Γ \leq P_{C H P, e}^{t} - P_{C H P, e}^{D} - \frac{P_{C H P, e}^{D} - P_{C H P, e}^{E}}{P_{C H P, h}^{D} - P_{C H P, h}^{E}} \cdot (P_{C H P, h}^{t} - P_{C H P, h}^{D}) \forall t \in T

(1d)

{\bar{X}}_{C H P}^{t} + {\underset{̲}{X}}_{C H P}^{t} = I_{C H P}^{t} \forall t \in T

(1e)

- (1 - {\underset{̲}{X}}_{C H P}^{t}) Γ \leq P_{C H P, h}^{t} - P_{C H P, h}^{E} \leq (1 - {\bar{X}}_{C H P}^{t}) Γ \forall t \in T

(1f)

0 \leq P_{C H P, e}^{t} \leq P_{C H P, e}^{A} I_{C H P}^{t} \forall t \in T

(1g)

0 \leq P_{C H P, h}^{t} \leq P_{C H P, h}^{A} I_{C H P}^{t} \forall t \in T

(1h)

where $P_{C H P, e}^{t}$ and $P_{C H P, h}^{t}$ are the output power and heat of micro-CHP unit at time $t$ , respectively; $P_{C H P, e}^{A}$ and $P_{C H P, h}^{A}$ are the generated power and heat of micro-CHP unit at point A, and those at other points B, C, D, E, and F are similarly defined; $Γ$ is a sufficiently large number used to assist in the model description; $I_{C H P}^{t}$ is the commitment status of the micro-CHP unit; $T = \{1,2, \dots, 24\}$ is the set of operational hours; and ${\bar{X}}_{C H P}^{t}$ and ${\underset{̲}{X}}_{C H P}^{t}$ are the operating statuses in the convex subregions I and II, respectively. If the micro-CHP unit operates in the convex subregion I, ${\bar{X}}_{C H P}^{t} = 1$ and ${\underset{̲}{X}}_{C H P}^{t} = 0$ ; otherwise, ${\underset{̲}{X}}_{C H P}^{t} = 1$ and ${\bar{X}}_{C H P}^{t} = 0$ .

The total operation cost of micro-CHP unit $C_{C H P}^{t}$ at time $t$ is expressed as:

C_{C H P}^{t} (P_{C H P, e}^{t}, P_{C H P, h}^{t}) = {\bar{α}}_{C H P} {(P_{C H P, e}^{t})}^{2} + {\bar{β}}_{C H P} P_{C H P, e}^{t} + {\bar{γ}}_{C H P} + {\underset{̲}{α}}_{C H P} {(P_{C H P, h}^{t})}^{2} + {\underset{̲}{β}}_{C H P} P_{C H P, h}^{t} + {\underset{̲}{γ}}_{C H P} P_{C H P, e}^{t} P_{C H P, h}^{t}

(2)

where ${\bar{α}}_{C H P}, {\underset{̲}{α}}_{C H P}$ , ${\bar{β}}_{C H P}, {\underset{̲}{β}}_{C H P}$ , ${\bar{γ}}_{C H P},$ and ${\underset{̲}{γ}}_{C H P}$ are the cost coefficients.

2)　BESS Modeling

The BESS is conceptualized as a battery capable of charging and discharging with distinct efficiencies. The operational strategy of BESS is designed with a granularity of one hour, corresponding to one time slot. This means that all charging and discharging activities of BESS within a time period are aggregated into a single operation. Consequently, the BESS can either charge or discharge in any given time slot, but not both simultaneously [

27].

E_{B E S S}^{t} = (1 - β) E_{B E S S}^{t - 1} + P_{B E S S, c}^{t} η_{B E S S, c} - P_{B E S S, d}^{t}

(3a)

0 \leq P_{B E S S, c}^{t} \leq S_{B E S S, c}^{t} P_{B E S S, c, m a x}

(3b)

0 \leq P_{B E S S, d}^{t} \leq S_{B E S S, d}^{t} P_{B E S S, d, m a x}

(3c)

S_{B E S S, c}^{t} + S_{B E S S, d}^{t} \leq 1

(3d)

E_{B E S S, m i n} \leq E_{B E S S}^{t} \leq E_{B E S S, m a x}

(3e)

where $E_{B E S S}^{t}$ is the state of charge (SoC) of BESS at time $t$ ; $β$ and $η_{B E S S, c}$ are the predetermined loss factor and charging efficiency, respectively; $P_{B E S S, c}^{t}$ and $P_{B E S S, d}^{t}$ are the charging power and discharging power of BESS at time $t$ , respectively; $S_{B E S S, c}^{t}$ and $S_{B E S S, d}^{t}$ are the charging state and discharging state of BESS at time $t$ , respectively; and the subscripts max and min represent the maximum and minimum values of corresponding variables, respectively.

The SoC of BESS is calculated in (3a). The charging power and discharging power of BESS are constrained by (3b) and (3c), respectively. Constraint (3d) is employed to determine the charging or discharging state of BESS. The total capacity of BESS is constrained by (3e).

3)　GB Modeling

The GB is modelled as an energy device transforming natural gas to heat with a fixed rate. The model of GB can be described as:

P_{G B, h}^{t} = η_{G B} P_{G B, g}^{t}

(4a)

P_{G B, g, m i n} \leq P_{G B, g}^{t} \leq P_{G B, g, m a x}

(4b)

P_{G B, h, m i n} \leq P_{G B, h}^{t} \leq P_{G B, h, m a x}

(4c)

where $η_{G B}$ is the natural gas conversion efficiency; $P_{G B, g}^{t}$ is the consumed natural gas of GB at time t; and $P_{G B, h}^{t}$ is the generated heat of GB at time t.

C. Optimization Problem

Considering all the models of devices in BIES presented above, the primary objective of BIES is to minimize the total cost of system operation. Specifically, the operational cost encompasses several components, including the cost of purchasing electricity and gas from the external markets (EMs), the degradation cost of BESSs, and the penalty incurred for unfulfilled energy demand. Consequently, the optimization problem for BIES operator can be formulated as:

\begin{array}{l} \underset{δ^{t}}{m i n} C_{b} = \sum_{t \in T}^{} [x_{w, e}^{t} (P_{C H P, e}^{t} + P_{B E S S, c}^{t} - P_{B E S S, d}^{t} - P_{P V, e}^{t}) + \\ x_{w, g}^{t} P_{G B, g}^{t} \overset{}{+} C_{C H P}^{t}] \end{array}

(5a)

s . t .

P_{e}^{t} = P_{w, e}^{t} + P_{P V, e}^{t} + P_{C H P, e}^{t} + P_{B E S S, d}^{t} - P_{B E S S, c}^{t}

(5b)

P_{h}^{t} = P_{C H P, h}^{t} + P_{G B, h}^{t}

(5c)

P_{w, g}^{t} = P_{C H P, g}^{t}

(5d)

where $P_{w, e}^{t}$ and $P_{w, g}^{t}$ are the power purchased from wholesale electricity and natural gas markets, respectively; $x_{w, e}^{t}$ and $x_{w, g}^{t}$ are the wholesale electricity and natural gas market prices, respectively; $P_{P V, e}^{t}$ is the power output of PV penal; and $P_{e}^{t}$ and $P_{h}^{t}$ are the power and heat demands of the BIES, respectively. The set of decision variables is denoted as $\{P_{C H P, e}^{t}, P_{B E S S, d}^{t}, P_{B E S S, c}^{t}, P_{w, e}^{t}, P_{C H P, h}^{t}, P_{G B, h}^{t}\}$ . The objective function aims to minimize the costs for purchasing electricity and operation of devices. Also, the objective is constrained by (1)-(4), and (5b)-(5d), where (1)-(4) are operating constraints for micro-CHP unit, BESS, and GB, and (5b)-(5d) indicate the multi-energy balance.

D. MDP

To optimize the decision-making process of BIES operator, we leverage an MDP to describe the optimization problem. We treat the BIES operator as an intelligent agent whose objective is to improve the operation decisions by minimizing the total cost in (5a). The MDP can be denoted by a tuple $<S^{t}, A^{t}, R^{t} (s, a), P^{t} (s, a), μ, γ^{t}>$ , where $S^{t} = \{x_{w, e}^{t}, x_{w, g}^{t}, E_{B E S S}^{t}, P_{e, f o r e}^{t}, P_{h, f o r e}^{t}, P_{P V, f o r e}^{t}\}$ is the state, which encompasses electricity price $x_{w, e}^{t}$ , natural gas price $x_{w, g}^{t}$ , SoC of BESS $E_{B E S S}^{t}$ , forecast of power demand $P_{e, f o r e}^{t}$ , forecast of heat demand $P_{h, f o r e}^{t}$ , and forecast of PV generation $P_{P V, f o r e}^{t}$ ; $A^{t} = \{P_{C H P, e}^{t}, P_{B E S S, d}^{t}, P_{B E S S, c}^{t}, P_{w, e}^{t}, P_{C H P, h}^{t}, P_{G B, h}^{t}\}$ is the action, including the decision variables in (5); $R^{t} (s, a)$ is the reward quantifying the agent performance, which is presented by the opposite of objective function in (5a); $μ$ is the policy of MDP, which contains a series of actions for each state; and $γ^{t}$ is the discount factor that discounts all rewards in the future state.

As the main objective of the agent is to identify the optimal policy that maximizes the accumulated return, we evaluate the value of each state using the state value function $V^{μ} (s)$ as given in (6). Moreover, the state-action value function $Q^{μ} (s, a)$ that captures the joint value of a particular action $a$ at a state $s$ is demonstrated in (7).

V^{μ} (s) = E (\sum_{t \in T}^{} γ^{t} R^{t} | s_{0} = s)

(6)

Q^{μ} (s, a) = E (\sum_{t \in T}^{} γ^{t} R^{t} | s_{0} = s, a_{0} = a)

(7)

where $E (\cdot)$ is the expectation function; and $s_{0}$ and $a_{0}$ are the initial state and action, respectively.

III. Proposed TFT-SAC Approach

In this section, we introduce a novel TFT-SAC approach to solve the optimal scheduling problem of BIES. The structure of the proposed TFT-SAC approach is depicted in Fig. 3. Specifically, the TFT uses historical PV power generation and energy consumption data alongside meteorological and static covariates (e.g., geographical coordinates and energy types) to forecast future trends. Variable selection networks (VSNs) identifies relevant features, while an LSTM network captures long-term dependencies. A multi-head self-attention layer focuses on crucial time steps, enhancing the forecasting accuracy. These forecasts inform subsequent optimization tasks. The SAC algorithm uses forecasting data to generate the optimal operation strategies for BIES. These strategies are implemented, and the resulting state transitions (state, action, reward, next state) are stored in the experience replay buffer (ERB). The experiences are sampled to train the critic and actor networks until the SAC algorithm converges, producing an optimal operation strategy for BIES. The details of the TFT and SAC algorithm are presented in the following subsections.

Fig. 3 Structure of proposed TFT-SAC approach.

A. TFT Model

This subsection introduces the TFT model, i.e., an interpretable deep learning model designed for time-series forecast. The TFT model effectively captures complex temporal relationships and delivers reliable forecasts, which are essential for managing BIES. Specifically, the interpretability of the multi-head self-attention mechanism and VSN stems from its ability to assign VSN weight and attention weight to input data points, thereby visualizing the most influential time steps and features in the prediction process. Detailed algorithm design is covered in the following.

1)　Quantile Outputs

The TFT model generates quantile forecasts, which are particularly useful for estimating the uncertainty of future forecasts. Suppose there are I unique forecasting objects in a given time-series dataset, such as PV power generation, power demand, and heat demand. The quantile forecasts are obtained through a linear transformation of the outputs from the temporal fusion decoder. The mathematical representation of this process is given as:

{\hat{y}}_{i} (q, t, τ) = f_{q} (τ, y_{i, t - k : t}, z_{i, t - k : t}, x_{i, t - k : t + τ}, s_{i})

(8)

where ${\hat{y}}_{i} (q, t, τ)$ is the q^th quantile value for predicting the future $τ$ steps at time point t; $f_{q} (\cdot)$ is the forecasting model; $y_{i, t - k : t}$ is the vector of historical target variables from time points $t - k$ to $t$ ; $z_{i, t - k : t}$ is the vector of past-observed inputs from time points $t - k$ to t; $x_{i, t - k : t + τ}$ is the vector of priori-known future inputs; and $s_{i}$ is the static metadata, which is the covariate in energy forecast.

The training of TFT model involves minimizing the quantile loss [

28], which is designed to penalize the overestimations and underestimations differently based on the quantile level. The quantile loss function is formulated as:

ℒ (Ω, W) = \sum_{y_{t} \in Ω} \sum_{q \in Ω} \sum_{τ = 1}^{τ_{m a x}} \frac{Q L (y_{t}, \hat{y} (q, t - τ_{m a x}, τ_{m a x}), q)}{M_{τ_{m a x}}}

(9)

where $ℒ (Ω, W)$ is the quantile loss of single time series at the average prediction point, $Ω$ is the domain of training data containing $M_{τ_{m a x}}$ samples, and $W$ is the weight of TFT model; y_t is the actual data; $\hat{y}$ is the prediction data; $τ_{m a x}$ is the maximum step; and the function $Q L (\cdot)$ can be expressed as:

Q L (y_{t}, \hat{y}, q) = q {(y_{t} - \hat{y})}_{+} + (1 - q) {(\hat{y} - y_{t})}_{+}

(10)

where QL includes predicted values corresponding to different quantiles (e.g., 0.1, 0.5, and 0.9); and ${(\cdot)}_{+} = m a x (0, \cdot)$ . To ensure consistency in prediction dimensions across different prediction points, the regularization is applied as:

q_{r i s k} = \frac{2 \sum_{y_{t} \in \tilde{Ω}}^{} \sum_{τ = 1}^{τ_{m a x}} Q L (y_{t}, \hat{y} (q, t - τ, τ), q)}{\sum_{y_{t} \in \tilde{Ω}}^{} \sum_{τ = 1}^{τ_{m a x}} |y_{t}|}

(11)

where $\tilde{Ω}$ is the domain of test samples; and q_risk is the normalized quantile losses across the entire forecasting horizon.

2)　Gating Mechanism

In the time-series forecast, especially with multiple regression, identifying relevant variables and the extent of non-linear processing is challenging. The TFT model uses gated residual networks (GRNs) for adaptive non-linear processing:

G R N_{ω} = L a y e r N o r m (a + G L U_{ω} (η_{1}))

(12)

η_{1} = W_{1, ω} η_{2} + b_{1, ω}

(13)

η_{2} = E L U (W_{2, ω} a + W_{3, ω} c + b_{2, ω})

(14)

G L U_{ω} (η_{1}) = σ (W_{4, ω} η_{1} + b_{4, ω}) ⊙ (W_{5, ω} η_{1} + b_{5, ω})

(15)

where $L a y e r N o r m (\cdot)$ is the layer normalization function; $a + G L U_{ω} (η_{1})$ represents the linear and nonlinear contributions, with $G L U_{ω}$ controlling the degree of nonlinearity, and $a$ is the vector of primary inputs to GRN; $c$ is an optional context vector; $E L U (\cdot)$ is the activation function of exponential linear unit; $σ (\cdot)$ is the sigmoid activation function; $W_{1, ω}$ , $W_{2, ω}$ , $W_{3, ω}$ , $W_{4, ω}$ , and $W_{5, ω}$ are the weight sharing indices; and $b_{1, ω}$ , $b_{2, ω}$ , $b_{4, ω}$ , and $b_{5, ω}$ are the bias sharing indices. The GRN layer is controlled by the GLU layer, which may skip the layer entirely if GLU outputs are close to 0.

3)　VSN

The VSN is a key component of the TFT that improves the performance by selecting important features and filtering out noises. It assigns weights to features, which are used to combine the processed inputs:

υ_{χ_{t}} = S o f t m a x (G R N_{υ_{χ}} (Ξ_{t}, c_{s}))

(16)

where $υ_{χ_{t}}$ is the weight corresponding to features; $Ξ_{t}$ is the flattened vector; and $c_{s}$ is obtained from the static covariate encoder. The processed features are weighted by their corresponding variable selection weights and then combined.

4)　Temporal Self-attention Layer

The TFT model employs a temporal self-attention layer that plays a key role in capturing long-term dependencies in time-series data. This layer not only improves the model ability to understand complex temporal relationships but also enhances the interpretability of forecasts. The self-attention layer used here is a masked and interpretable multi-head attention layer combined with a gating mechanism to selectively control information flow.

1)　Self-attention mechanism

The core concept behind the temporal self-attention layer is to calculate the relevance, or “attention”, of different time steps to each other, enabling the TFT model to focus on important events or sequences within the data. This is done using the following equation for attention:

A t t e n t i o n (Q, K, V) = A (Q, K) V

(17)

where V is the value of input based on the similarity between the query vector $Q$ and key vector $K$ ; and $A (\cdot)$ is a normalization function that determines the attention weights of value V. The scaled dot-product mechanism for calculating attention is defined as:

A (Q, K) = S o f t m a x (\frac{Q K^{T}}{\sqrt[]{d_{a t t n}}})

(18)

where $d_{a t t n}$ is the dimension of attention layer.

2)　Multi-head self-attention mechanism

Multi-head self-attention mechanism enhances the power of the self-attention mechanism by allowing the model to jointly focus on information from different representation subspaces at different positions. Instead of using a single set of queries, keys, and values, the multi-head self-attention mechanism splits them into multiple sets, each of which is processed independently. Each head computes attention separately, and the results are then concatenated and linearly transformed to produce the final output. By having multiple heads, the TFT model can capture a richer set of relationships and nuances in the data compared with a self-attention mechanism, which are presented as:

M u l t i H e a d (Q, K, V) = [H_{1} H_{2} . . . H_{m_{H}}] W_{H}

(19)

H_{h} = A t t e n t i o n (Q W_{Q}^{(h)}, K W_{K}^{(h)}, V W_{V}^{(h)})

(20)

where $W_{Q}^{(h)} \in R^{d_{m o d e l} \times d_{a t t n}}$ , $W_{K}^{(h)} \in R^{d_{m o d e l} \times d_{a t t n}}$ , and $W_{V}^{(h)} \in R^{d_{m o d e l} \times d_{V}}$ are the head-specific weights for queries, keys, and values, respectively, and $d_{m o d e l}$ and $d_{V}$ are the dimensions of model and weight, respectively; and $W_{H} \in R^{(m_{H} d_{V}) \times d_{m o d e l}}$ linearly combines outputs concatenated from all heads $H_{h}$ ( $h = 1,2, \dots, m_{H}$ ), and m_H is the number of heads.

3)　Interpretability enhancement

One of the main issues with traditional multi-head attention mechanism is that each head uses different value vectors, making it difficult to directly determine the feature importance from the attention weights. By modifying the mechanism to share the same value vector across all heads, the TFT model can produce a unified set of attention weights, thereby improving interpretability:

I M H (Q, K, V) = \tilde{H} {\tilde{W}}_{H}

(21)

\tilde{H} = \tilde{A} (Q, K) V W_{V} = (\frac{1}{m_{H}} \sum_{h = 1}^{m_{H}} A (Q W_{Q}^{(h)}, K W_{K}^{(h)})) V W_{V} = \frac{1}{m_{H}} \sum_{h = 1}^{m_{H}} A t t e n t i o n (Q W_{Q}^{(h)}, K W_{K}^{(h)}, V W_{V})

(22)

where $I M H (\cdot)$ is the interpretable multi-head; ${\tilde{W}}_{H} \in R^{d_{m o d e l} \times d_{a t t n}}$ denotes the final linear mapping used across $W_{H}$ ; and $W_{V} \in R^{d_{m o d e l} \times d_{V}}$ is the value weight shared across all heads. Compared with $A (Q, K)$ in (18), this modification allows each attention head to share the same set of values $\tilde{A} (Q, K)$ , resulting in a single and interpretable set of attention scores that can be analyzed to determine feature importance [

29].

B. SAC Algorithm

In this subsection, we describe the SAC algorithm, which is a state-of-the-art maximum-entropy-based off-policy DRL algorithm, to solve the optimization problem of BIES. Typical DRL algorithms generally suffer from limited robustness in real-world applications due to ineffective exploration [

30]. In contrast, the SAC algorithm uses entropy as a regularization term in the objective function to enhance the adaptability and generalization performance.

1)　Algorithm Description

As a DRL algorithm with an actor-critic structure, the SAC algorithm outperforms most algorithms, e.g., DDPG, in convergence performance. SAC algorithm maximizes both accumulative rewards and policy entropy. The entropy function $H (\cdot)$ is defined in (23), where $π (\cdot | s_{t})$ is the strategy condition to the state $s_{t}$ . The state value function $V_{r}^{μ} (s)$ and state-action value function $Q_{r}^{μ} (s, a)$ are presented in (24) and (25), respectively, where the temperature parameter $α$ determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy [

30].

H (π (\cdot | s_{t})) = - \sum_{a} π (a | s_{t}) l n π (a | s_{t})

(23)

V_{r}^{μ} (s) = E (\sum_{t \in T}^{} γ^{t} (R_{t} + α H (π (\cdot | s_{t}))) | s_{0} = s)

(24)

Q_{r}^{μ} (s, a) = E (\sum_{t \in T}^{} γ^{t} (R_{t} + α \sum_{t \in T}^{} H (π (\cdot | s_{t}))) |s_{0} = s, a_{0} = a)

(25)

At the same time, the state value function can be presented as (26) according to (23) and (24). Equation (24) allows us to derive the solution of the policy as (27).

V_{r}^{μ} (s_{t}) = E (Q_{r}^{μ} (s, a)) + α H (π (\cdot | s_{t}))

(26)

π^{*} (\cdot | s_{t}) = a r g \underset{π \in ∆}{m a x} V_{r}^{μ} (s) = \frac{e^{Q_{h}^{π} (s, \cdot) / α}}{\sum_{a} e^{Q_{h}^{π} (s, a) / α}}

(27)

where $∆ = \{π | π \geq 0, 1 \cdot [π] = 1\}$ , guaranteeing that $π$ is a valid probabilistic distribution on the action space; and $Q_{h}^{π} (s, \cdot)$ is the $Q$ function for taking all each action in the state $s$ . When the Q value converges to the optima, the optimal policy achieves the optimal state value function. Therefore, the updating of Q-value function can be realized by using the closed-form solution in an off-policy scheme.

2)　Algorithm Implementation

The SAC algorithm adopts an actor-critic structure with DNNs to estimate the policy (actor) and Q-value functions (critic). The actor network is represented by the policy function $μ (s | θ^{μ})$ parameterized by $θ^{μ}$ . The critic employs clipped double Q networks $Q_{1}$ and $Q_{2}$ and their target networks $Q_{1}^{'}$ and $Q_{2}^{'}$ . Therefore, the target $y_{t}$ for the Q value is expressed as (28). Then, the L2 loss is used to update the Q-network in (29) for $j = \{1,2\}$ .

y_{t} = r_{t} + γ (\underset{j \in \{1,2\}}{m i n} Q_{j} (s_{t + 1}, {\tilde{a}}_{t + 1} | θ^{Q_{j}}) - α l g π_{θ} ({\tilde{a}}_{t + 1} | s_{t + 1}))

(28)

\nabla_{θ^{Q}} L = \frac{1}{N} \sum_{n \in 𝒩} {(y_{t} - Q_{j} (s, a | θ^{Q_{j}}))}^{2}

(29)

where ${\tilde{a}}_{t + 1}$ is the action under the current policy in the next state $s_{t + 1}$ ; $𝒩 = \{1,2, . . ., N\}$ is the set of mini batches indexed by $n$ ; and $π_{θ}$ is the executed policy.

To train these networks, the agent randomly samples tuples $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from the ERB to form a mini batch $n =$ for experience replay learning. The online critic networks are updated by one step of gradient descent to the mean square error (MSE) $θ^{Q}$ in (29), while the actor network is updated by one step of gradient ascent using (30). To stabilize the training, the target network parameters are soft updated with (31).

\nabla_{θ^{μ}} L = \nabla_{θ^{μ}} \frac{1}{N} (\underset{j \in \{1,2\}}{m i n} Q_{j} (s_{t}, {\tilde{a}}_{t} (s)) - α l g π_{θ} (a_{t} | s_{t}))

(30)

θ^{Q^{'}} \leftarrow ρ θ^{Q} + (1 - ρ) θ^{Q^{'}}

(31)

where ${\tilde{a}}_{t} (s)$ is a sample from $π_{θ} (\cdot | s_{t})$ ; and $ρ$ is the soft update parameter.

C. Discussions

The use of the proposed TFT-SAC approach is unique and effective for the dynamic operation and control of BIES. This combination offers several advantages but also has potential shortcomings compared with other traditional approaches.

1) Integrated forecasting and operation: the TFT provides accurate and data-driven forecasts of PV generation and energy demand, which allows the SAC algorithm to make informed decisions. This integration reduces uncertainty in the decision-making process, leading to more reliable system operations.

2) Offline training and efficient online operation: the proposed TFT-SAC approach allows for offline training using historical data, enabling the development of a robust policy before deployment. Once trained, the algorithm operates in real time with minimal computational overhead, which is a significant advantage over approaches like SO or RO that require repeated recalculation.

3) Handling non-convexity: the operation of BIES involves non-convex constraints such as the FOR. The SAC algorithm, leveraging DNNs, can effectively learn non-convex optimal operating policies due to the powerful representation capabilities of DNNs. In comparison, traditional mathematical programming approaches such as mixed-integer linear programming (MILP) address non-convexity by linearizing nonlinear relationships and explicitly formulating integer constraints, facing scalability and computational challenges particularly in large and dynamic systems like BIES. Heuristic algorithms can explore complex optimization landscapes and are often more flexible than mathematical programming. However, they may suffer from high computational demands, especially in large-scale systems, and may converge to local optima rather than finding the global solution.

4) Training complexity: the proposed TFT-SAC approach requires extensive offline training, which can be computationally expensive and time-consuming, particularly for large datasets. The performance highly relies to a high-quality training dataset, which is typically hard to acquire in the real world.

5) Dependence on forecasting accuracy: the effectiveness of SAC algorithm in making optimal decisions depends heavily on the forecasting accuracy provided by TFT. If the forecasts are inaccurate due to unexpected external factors, the quality of the operational decisions may be compromised.

Overall, the proposed TFT-SAC approach provides an effective solution for BIES operation. The integrated forecast and optimized structure, capability to handle non-convexity, and efficient implementation make this approach a compelling alternative to traditional approaches, despite some challenges related to training complexity and dependence on forecasting accuracy.

IV. Case Study

A. Simulation Setup

To validate the effectiveness of the proposed TFT-SAC approach, we conduct case studies using data from a real building located in Zhenjiang, China. The BIES under study comprises devices like a micro-CHP unit, PV panels, BESSs, and GBs to meet both heat and power demands.

The micro-CHP unit, with a rated output of 25.3 kW, is designed to satisfy the heat demand of the building while partially covering its power demand. The PV system includes 610 PV panels, each with a capacity of 280 W, resulting in a theoretical maximum output of 170.8 kW. However, due to practical limitations, the actual capacity is 153 kW. The BESS consists of 24 LiFePO₄ batteries, each with a storage capacity of 5.12 kWh, providing a maximum output of 72 kW. This setup enables the BESS to support peak power demand for up to 4 hours. Detailed information of micro-CHP unit and BESS is shown in Supplementary Material A.

The proposed TFT-SAC approach is implemented in Python, and the neural networks are developed using PyTorch [

31]. To achieve the optimal performance, the neural network parameters and hyperparameters are carefully chosen based on empirical values and adjusted throughout the training process. The complete configuration details for SAC algorithm are presented in Tables II and III. While hyperparameter setting of TFT for forecasts of energy demand and PV generation is shown in Table IV.

TABLE II Neural Network Architecture Setting of SAC Algorithm

Neutral network	Number of hidden layers	Number of neurons	Learning rate	Soft update parameter	Optimizer
Actor	3	[512, 32]	1×10^-4	11×10^-2	Adam
Critic	2	[512, 32]	1×10^-3	11×10^-2	Adam

TABLE III Hyperparameter Setting of SAC Algorithm

Training parameter	Number
Replay buffer size	1×10⁶
Replay start size	128
Batch size	128
Discount factor	0.99

TABLE IV Hyperparameter Setting of TFT for Forecasts of Energy Demand and PV Generation

Parameter	Forecast of energy demand	Forecast of PV generation
Learning rate	1×10^-4	3.5×10^-3
Grad clip value	0.1	0.9
Patience	10	2
Batch size	16	16
Drop out	0.2	0.1
Time step	168	24
Hidden size	128	32
Number of LSTM layers	6	4
Number of attention heads	6	3
Loss function	Quantile loss	Quantile loss

B. Computational Performance of Different Algorithms

This subsection compares the SAC algorithm with baseline algorithms TD3 and DDPG. Each algorithm is trained for 10000 episodes on sampled days from the training set. Figure 4 shows the episodic reward evolution of different algorithms during offline training process. Considering the fluctuations in state features, the data have been smoothed using a 100-episode moving average method. This is because the oscillations caused by the exogenous state features cannot be addressed by the operational strategies even if the policy is optimal.

Fig. 4 Episodic reward evolution of different algorithms during offline training process.

Figure 4 shows that initially, the learning curves of different algorithms are similar due to randomly selected energy schedules and Gaussian noises. Early on, rewards are low for all algorithms. As training progresses, rewards increase as agents learn and refine their policies. The reward of SAC algorithm grows the fastest initially, followed by TD3 and then DDPG. Around 2000 iterations, the reward of DDPG increases sharply, surpassing TD3 but is still lower than SAC algorithm, which is close to converging. DDPG and TD3 converge around 5000 iterations. The SAC algorithm achieves a significantly higher final reward compared with DDPG and TD3, with the reward of DDPG slightly higher than that of TD3. This indicates the superior offline training performance of SAC algorithm.

To evaluate the performance of the proposed TFT-SAC approach, we use the trained actor network parameters to generate operational strategies for the BIES over 50 test days. We compare this forecasting-combined RL approach with benchmark approaches: typical RL approaches (TD3, DDPG, and SAC) and another forecasting-combined RL approach (LSTM-SAC). Figure 5 compares the cumulative costs for energy consumption with different approaches over 50 test days. The results indicate that the cumulative costs with typical RL approaches are significantly higher than those with forecasting-combined RL approaches. The cost gap increases with more training episodes, highlighting the differences between different approaches. For forecasting-combined approaches, the cumulative costs are similar, showing that combining forecasting with RL is effective. Notably, the proposed TFT-SAC approach achieves lower costs than LSTM-SAC, demonstrating its superior performance. However, the difference between the proposed TFT-SAC approach and LSTM-SAC is small compared with their differences between typical RL approaches, suggesting limited room for improvement in current forecasting-combined RL approaches.

Fig. 5 Cumulative cost for energy consumption with different approaches over 50 test days.

C. Forecasting Performance Analysis

As can be known from Table V, the TFT model outperforms the LSTM model across three performance metrics, i.e., mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination R² in forecasts of both PV generation and energy demand.

TABLE V Performance Metrics of TFT and LSTM Models

Forecast object	Model	MAE	RMSE	R²
PV generation	LSTM	3.66	12.23	0.8402
PV generation	TFT	5.22	11.24	0.8721
Energy demand	LSTM	3.37	4.60	0.9407
Energy demand	TFT	2.20	3.26	0.9670

Figures 6 and 7 show that the forecasting curves of TFT model closely fit the target curves, demonstrating its effectiveness in capturing time-series patterns. The TFT model particularly excels in PV generation forecasting, accurately capturing peaks and valleys, which is crucial for energy forecast. In summary, the TFT model shows superior forecasting accuracy and pattern recognition compared with the LSTM model, which is crucial for energy management in BIES, guiding energy allocation, optimizing resource utilization, and improving overall energy efficiency.

Fig. 6 Performance of LSTM and TFT models in PV generation forecasting.

Fig. 7 Performance of LSTM and TFT models in energy demand forecasting.

The meteorological data include net solar irradiation (NSI), solar irradiation (SI), ultraviolet (UV), outdoor air temperature (OAT), rainfall (RF), relative humidity (RH), temperature-humidity-wind (THW), and surface air temperature (SAT). Figure 8 illustrates the relative importance of different features in the TFT model for PV generation forecasting. In the encoder, SI appears as the most significant factor, indicating that direct sunlight intensity plays a crucial role in PV generation forecasting. Meanwhile, in the decoder, longitude emerges as the most important feature, highlighting the importance of geographical positioning in the forecasting process. This is intuitive because the position affects the angle of sunlight and daylight duration, which ultimately impacts PV generation.

Fig. 8 Relative importance of different features in TFT model for forecasting PV generation. (a) Encoder. (b) Decoder.

Figure 9 depicts relative importance of different features in TFT model for forecasting energy demand. Unlike PV generation, which predominantly relies on weather-related factors, the energy demand is highly influenced by calendar-based information. Features such as hour of the day, workday status, and specific time-based attributes are ranked highly, reflecting the relationship between user behavior and energy usage. These calendar-related features indicate the impact of typical human activities and routines such as work schedules and holidays on energy demand.

Fig. 9 Relative importance of different features in TFT model for forecasting energy demand. (a) Encoder. (b) Decoder.

The importance ranking reveals that the TFT model considers both weather conditions and temporal attributes to accurately forecast energy demands. This is crucial because user activities are often influenced by the time of day or specific events on the calendar, and these behavioral patterns significantly affect energy usage in buildings. The model attention to these aspects shows its ability to learn from diverse data sources and focuses on the most impactful features during the training process, resulting in a more reliable forecast.

Figures 10 and 11 illustrate the attention distribution of TFT model over the past 7 days (indexed by $- 7$ to $- 1$ ) during the forecasting process. Figure 10 shows that the attention of TFT model is concentrated on the recent past, especially the previous day, reflecting the strong daily cyclic patterns of PV generation. Minor peaks indicate the consideration of earlier time steps, which have lower weights due to the influence of short-term environmental factors like SI.

Fig. 10 Attention of TFT model over past 7 days for forecasting PV generation.

Fig. 11 Attention of TFT model over past 7 days for forecasting energy demand.

Figure 11 shows a smooth distribution across various historical time steps with a gradual increase. This suggests that the TFT model considers a range of past data, reflecting that the high complexity and irregularity of energy demands are influenced by factors like user behavior, daily activities, and weather conditions.

In comparison, the TFT model for PV generation forecasting focuses on recent time steps due to daily cyclic patterns, while that for forecasting energy demands has a broad attention span over the entire historical cycle, balancing long-term trends and short-term impacts. The gradual increase in attention weights indicates the emphasis on recent information for imminent forecasts.

The uniform attention distribution for energy demand suggests its cyclical patterns are less pronounced or more complex than that of PV generation. This highlights the importance of extracting information from multiple time scales for accurate forecasts and underscores the need for effective energy management strategies to optimize BIES operational efficiency.

In summary, the TFT model provides accurate and interpretable forecasts for both PV generation and energy demand, supporting the RL algorithm in formulating efficient scheduling strategies.

D. Generalization Performance

To validate the generalization performance, different approaches are tested over a test set that shows different statistical characteristics compared with the training set. The test set is represented by several typical weeks labeled W-1 to W-4 for comparative analysis. These typical weeks include scenarios with extreme PV generation or energy demand. Table VI presents the daily operational costs of BIES across different weeks. The results clearly demonstrate that forecasting-combined RL approaches achieve significantly lower operational costs compared with typical RL approaches, underscoring the effectiveness of combining the forecast and decision-making. Furthermore, the average operational cost of proposed TFT-SAC approach is lower than that of LSTM-SAC, indicating that the proposed TFT-SAC approach outperforms all the comparable approaches across different scenarios, thereby demonstrating its strong generalization capabilities. Although the daily cost improvements may appear marginal, the cumulative benefits of the proposed TFT-SAC approach over extended operation could result in substantial additional profits.

TABLE VI Comparison of Daily Average Operational Cost of BIES Across Different Weeks

Week	Daily average operational cost (¥)
Week	DDPG	TD3	SAC	LSTM-SAC	TFT-SAC
W-1	500.14	499.30	490.19	328.02	325.79
W-2	361.75	361.20	347.92	232.76	231.60
W-3	450.34	449.66	431.40	318.91	311.03
W-4	733.25	732.44	715.75	521.30	520.99

E. Robust Operation

To compare the robustness of the proposed TFT-SAC approach with other RL approaches, we introduce independent Gaussian noises to real PV generation and energy demand to represent uncertain scenarios. The average daily operational costs of BIES at different noise levels are presented in Table VII.

TABLE VII Comparison of Daily Average Operational Cost of BIES at Different Noise Levels

Noise levelN	Daily average operational cost (¥)
Noise levelN	DDPG	TD3	SAC	LSTM-SAC	TFT-SAC
0.01	596.07	557.56	557.49	505.12	490.04
0.02	596.38	558.24	558.18	505.82	491.88
0.03	597.37	559.02	558.96	506.62	494.91
0.04	599.80	559.85	559.78	507.47	495.13
0.05	603.91	560.73	560.66	508.38	495.17

Across all noise levels, the typical RL approaches incur significantly higher operational costs than forecasting-combined RL approaches, with cost differences ranging from ¥60 to ¥100. Among all the tested approaches, the proposed TFT-SAC approach demonstrates the lowest average operational costs, indicating superior robustness. However, the cost variations between the proposed TFT-SAC approach and LSTM-SAC remain small in the range of ¥10 and $¥ 20$ . In contrast, the cost difference of the proposed TFT-SAC approach with $N = 0.01$ and $N = 0.05$ is approximately $¥ 5$ , and that of TD3, SAC, and LSTM-SAC is $¥ 3$ . This larger cost variation suggests that proposed TFT-SAC approach is more sensitive to forecasting accuracy than other approaches, even though it consistently achieves the lowest average operational costs among all approaches.

F. Operational Analysis

To evaluate the generalization of the optimal energy management policy learned by the proposed TFT-SAC approach, we apply two typical scenarios: a summer day (August 27) and a winter day (December 25). Figures 12 and 13 show the power and heat profiles on the two typical days, respectively.

Fig. 12 Power generation and consumption of BIES. (a) A summer day. (b) A winter day.

Fig. 13 Heat generation and consumption of BIES. (a) A summer day. (b) A typical day.

Both scenarios share common trends. Initially, from 00:00 to 08:00, the BIES purchases electricity due to zero PV generation and low SoC of BESS. BESS charges at low prices for future demands. From 09:00 to 15:00, PV generation and BESS discharging could meet most power demands, with excess power sold with high electricity prices. From 18:00 to 24:00, the BIES does not sell electricity, and micro-CHP unit becomes the primary power source due to high demand.

Nevertheless, there are some evident differences between the two typical days. On the winter day, the micro-CHP unit operates from 09:00 to 15:00 to meet high heat demands and support the power demands due to low PV generation. On the summer day, the micro-CHP unit is inactive as PV and BESS can meet the demands and the excess power is sold. The policy effectively uses micro-CHP unit in winter and BESS in summer, charging at low prices and discharging at high prices to maximize the economic benefits.

Finally, it can be concluded that the proposed TFT-SAC approach can learn an effective policy and can generalize to variable state information on different test days. Also, the flexibility of BIES is investigated on two typical winter and summer days. Specifically, the summer day has a higher PV generation and lower heat demand, so it has a higher energy export and makes use of more flexibility of BIES. Due to the lower PV generation and higher heat demand, the winter day has a higher power import and a higher utilization of the micro-CHP unit, which also provides lots of flexibility to BIES.

G. Sensitivity Analysis

In this subsection, a detailed sensitivity analysis is conducted to evaluate the impact of changes in key factors on the operation and performance of BIES. Specifically, we analyze the sensitivities of the episodic reward to variations in electricity price, PV generation, power demand, and heat demand, as shown in Fig. 14.

Fig. 14 Sensitivity analysis on several factor.

The sensitivity analysis is performed by varying each parameter independently from 90% to 110% of the initial configured value, with a granularity of 5%. This range is selected to represent potential fluctuations in market and operational conditions, and the granularity is chosen to provide a balanced level of detail without excessive computational overhead.

The results in Fig. 14 indicate that the episodic rewards of BIES are negatively correlated with electricity price, which is expected given that higher electricity prices increase the cost of electricity purchase. There is a positive correlation between PV generation and episodic reward, as the increased PV generation reduces the need for power from EM and allows for more excess power to be sold back to EM. Both power and heat demands negatively impact the rewards, with power demand having a particularly significant effect. This can be attributed to the fact that meeting higher demands requires more energy procurement, which incurs additional costs.

Interestingly, the power demand has a greater effect on the episodic reward compared with PV generation. This is because the total daily PV generation is lower than the total power demand. As a result, any decrease in power demand has a larger marginal impact on profitability, either through reduced procurement or allowing more energy to be sold during peak periods.

In terms of scheduling policies, the changes in power demand and PV generation lead to noticeable shifts in action prioritization. For instance, the increased PV generation results in more frequent utilization of BESS for energy arbitrage, while fluctuations in electricity price affect decisions regarding energy procurement timing. These findings emphasize the importance of accurate forecasts for PV generation and energy demand in effectively optimizing the operational strategies of BIES.

V. Conclusion

In conclusion, this paper develops a novel hybrid data-driven approach, i.e., TFT-SAC approach, for the optimal scheduling in BIES. Specifically, the TFT model enhances the forecasting accuracy and transparency through attention mechanisms and the VSN, enhancing interpretability and trustworthy of forecasting results. The integration of SAC algorithm for optimization further strengthens this framework by ensuring more effective exploration during training, leading to stronger robustness and generalization capabilities. Simulation results demonstrate the superior performance of the proposed TFT-SAC approach compared with the existing approaches. The interpretability of the TFT model and the generalization performance of SAC algorithm are analyzed. The sensitivity analysis of reward on several key factors in BIES is also made.

References

X. Cao, X. Dai, and J. Liu, “Building energy-consumption status worldwide and the state-of-the-art technologies for zero-energy buildings during the past decade,” Energy and Buildings, vol. 128, pp. 198-213, Sept. 2016. [Baidu Scholar]

W. Wu, P. Li, B. Wang et al., “Integrated distribution management system: architecture, functions, and application in China,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 2, pp. 245-258, Mar. 2022. [Baidu Scholar]

H. Qiu, V. Veerasamy, C. Ning et al., “Two-stage robust optimization for assessment of PV hosting capacity based on decision-dependent uncertainty,” Journal of Modern Power Systems and Clean Energy, vol. 12, no. 6, pp. 2091-2096, Nov. 2024. [Baidu Scholar]

X. Huang, Z. Xu, Y. Sun et al., “Heat and power load dispatching considering energy storage of district heating system and electric boilers,” Journal of Modern Power Systems and Clean Energy, vol. 6, no. 5, pp. 992-1003, Nov. 2018. [Baidu Scholar]

C. Huang, H. Zhang, L. Wang et al., “Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 743-754, May 2022. [Baidu Scholar]

H. Zhao, B. Wang, X. Wang et al., “Active dynamic aggregation model for distributed integrated energy system as virtual power plant,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 5, pp. 831-840, Sept. 2020. [Baidu Scholar]

M. Sechilariu, B. Wang, and F. Locment, “Building integrated photovoltaic system with energy storage and smart grid communication,” IEEE Transactions on Industrial Electronics, vol. 60, no. 4, pp. 1607-1618, Apr. 2013. [Baidu Scholar]

Y. Li, C. Wang, G. Li et al., “Improving operational flexibility of integrated energy system with uncertain renewable generations considering thermal inertia of buildings,” Energy Conversion and Management, vol. 207, p. 112526, Mar. 2020. [Baidu Scholar]

R. Jing, M. Wang, Z. Zhang et al., “Comparative study of posteriori decision-making methods when designing building integrated energy systems with multi-objectives,” Energy and Buildings, vol. 194, pp. 123-139, Jul. 2019. [Baidu Scholar]

Y. Zhang, P. E. Campana, A. Lundblad et al., “Planning and operation of an integrated energy system in a Swedish building,” Energy Conversion and Management, vol. 199, p. 111920, Nov. 2019. [Baidu Scholar]

Z. Zhu, Z. Hu, K. W. Chan et al., “Reinforcement learning in deregulated energy market: a comprehensive review,” Applied Energy, vol. 329, p. 120212, Jan. 2023. [Baidu Scholar]

A. Dolatabadi, H. Abdeltawab, and Y. A. I. Mohamed, “A novel model-free deep reinforcement learning framework for energy management of a PV integrated energy hub,” IEEE Transactions on Power Systems, vol. 38, no. 5, pp. 4840-4852, Sept. 2023. [Baidu Scholar]

D. Qiu, Z. Dong, X. Zhang et al., “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Applied Energy, vol. 309, p. 118403, Mar. 2022. [Baidu Scholar]

Z. Zhu, K. W. Chan, S. Xia et al., “Optimal bi-level bidding and dispatching strategy between active distribution network and virtual alliances using distributed robust multi-agent deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 13, no. 4, pp. 2833-2843, Jul. 2022. [Baidu Scholar]

Y. Zhou, Z. Ma, J. Zhang et al., “Data-driven stochastic energy management of multi energy system using deep reinforcement learning,” Energy, vol. 261, p. 125187, Dec. 2022. [Baidu Scholar]

Z. Hu, K. W. Chan, Z. Zhu et al., “Techno-economic modeling and safe operational optimization of multi-network constrained integrated community energy systems,” Advances in Applied Energy, vol. 15, p. 100183, Sept. 2024. [Baidu Scholar]

Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]

D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]

Q. Ma and C. Deng, “Simplified deep reinforcement learning based volt-var control of topologically variable power system,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1396-1404, Sept. 2023. [Baidu Scholar]

Y. Wang, M. Mao, L. Chang et al., “Intelligent voltage control method in active distribution networks based on averaged weighted double deep Q-network algorithm,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 132-143, Jan. 2023. [Baidu Scholar]

B. Lim, S. Ö. Arık, N. Loeff et al., “Temporal fusion transformers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748-1764, Oct. 2021. [Baidu Scholar]

W. J. von Eschenbach, “Transparency and the black box problem: why we do not trust AI,” Philosophy & Technology, vol. 34, no. 4, pp. 1607-1622, Sept. 2021. [Baidu Scholar]

T. M. Alabi, L. Lu, and Z. Yang, “Data-driven optimal scheduling of multi-energy system virtual power plant (MEVPP) incorporating carbon capture system (CCS), electric vehicle flexibility, and clean energy marketer (CEM) strategy,” Applied Energy, vol. 314, p. 118997, May 2022. [Baidu Scholar]

S. Zhou, D. He, Z. Zhang et al., “A data-driven scheduling approach for hydrogen penetrated energy system using LSTM network,” Sustainability, vol. 11, no. 23, p. 6784, Dec. 2019. [Baidu Scholar]

A. Kämper, R. Delorme, L. Leenders et al., “Boosting operational optimization of multi-energy systems by artificial neural nets,” Computers & Chemical Engineering, vol. 173, p. 108208, May 2023. [Baidu Scholar]

Y. Xu, W. Gao, Y. Li et al., “Operational optimization for the grid-connected residential photovoltaic-battery system using model-based reinforcement learning,” Journal of Building Engineering, vol. 73, p. 106774, Aug. 2023. [Baidu Scholar]

G. Pan, W. Gu, Y. Lu et al., “Optimal planning for electricity-hydrogen integrated energy system considering power to hydrogen and heat and seasonal storage,” IEEE Transactions on Sustainable Energy, vol. 11, no. 4, pp. 2662-2676, Oct. 2020. [Baidu Scholar]

R. Wen, K. Torkkola, B. Narayanaswamy et al. (2017, Nov.). A multi-horizon quantile recurrent forecast. [Online]. Available: https://arxiv.org/abs/1711.11053 [Baidu Scholar]

A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 1-10, Aug. 2017. [Baidu Scholar]

T. Haarnoja, A. Zhou, K. Hartikainen et al. (2018, Jan.). Soft actor-critic algorithms and applications. [Online]. Available: https://arxiv.org/abs/1812.05905. [Baidu Scholar]

A. Paszke, S. Gross, F. Massa et al., “Pytorch: an imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, pp. 1-12, Dec. 2019. [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher