Journal of Modern Power Systems and Clean Energy

ISSN 2196-5625 CN 32-1884/TK

网刊加载中。。。

使用Chrome浏览器效果最佳,继续浏览,你可能不会看到最佳的展示效果,

确定继续浏览么?

复制成功,请在其他浏览器进行阅读

Probabilistic Residential Load Forecasting with Sequence-to-sequence Adversarial Domain Adaptation Networks  PDF

  • Hanjiang Dong 1,2 (Graduate Student Member, IEEE)
  • Jizhong Zhu 1 (Fellow, IEEE)
  • Shenglin Li 1
  • Yuwang Miao 1
  • Chi Yung Chung 2 (Fellow, IEEE)
  • Ziyu Chen 1
1. School of Electric Power Engineering, South China University of Technology, Guangzhou, China; 2. Department of Electrical Engineering, The Hong Kong Polytechnic University, Hong Kong, China

Updated:2024-09-24

DOI:10.35833/MPCE.2023.000841

  • Full Text
  • Figs & Tabs
  • References
  • Authors
  • About
CITE
OUTLINE

Abstract

Lately, the power demand of consumers is increasing in distribution networks, while renewable power generation keeps penetrating into the distribution networks. Insufficient data make it hard to accurately predict the new residential load or newly built apartments with volatile and changing time-series characteristics in terms of frequency and magnitude. Hence, this paper proposes a short-term probabilistic residential load forecasting scheme based on transfer learning and deep learning techniques. First, we formulate the short-term probabilistic residential load forecasting problem. Then, we propose a sequence-to-sequence (Seq2Seq) adversarial domain adaptation network and its joint training strategy to transfer generic features from the source domain (with massive consumption records of regular loads) to the target domain (with limited observations of new residential loads) and simultaneously minimize the domain difference and forecasting errors when solving the forecasting problem. For implementation, the dominant techniques or elements are used as the submodules of the Seq2Seq adversarial domain adaptation network, including the Seq2Seq recurrent neural networks (RNNs) composed of a long short-term memory (LSTM) encoder and an LSTM decoder, and quantile loss. Finally, this study conducts the case studies via multiple evaluation indices, comparative methods of classic machine learning and advanced deep learning, and various available data of the new residentical loads and other regular loads. The experimental results validate the effectiveness and stability of the proposed scheme.

I. Introduction

SHORT-TERM load forecasting of power demand is one of the main research areas in electrical engineering [

1]. Load forecasts are crucial to the planning and operation of power systems, particularly in the context of developing electricity markets and promoting renewable energy. When integrating distributed renewables, generating accurate forecasts is one of the essential parts in the energy management of residential households [2] conducted by regional operators. For instance, the renewable power generation such as photovoltaic may provide part of the household power demand. The accurate load forecasts are important for the operators, especially in twofold contexts. First, the capacity of distributed solar power can occupy over 45% of the total capacity [3]. Then, residential prosumers generate around two-thirds of the growth in distributed solar power [4].

Numerous techniques have been applied to generate forecasts, which can be grouped into classic statistical techniques (e.g., stochastic time-series models [

5]), machine learning (e.g., support vector regression [6]), and deep learning techniques (e.g., long short-term memory (LSTM) recurrent neural networks (RNNs)) [7]. Identifying an adequate model is the key to producing accurate forecasts and is the core link of the forecasting procedure [8]. Deep learning, which allows computational models consisting of multiple processing layers to learn representations of data with multiple levels of abstraction, has improved the state-of-the-art in financial indices, weather indices, energy demand, and other areas in addition to achieving breakthroughs in computer vision, natural language processing, and speech recognition [9]. The success of deep learning relies on the automation of discovering intricate neural networks (NNs) in high-dimensional features of each hidden layer in modularized, data-driven, and end-to-end manners instead of manually selecting a given dataset or given data sources [10].

Lately, the community has paid attention to residential load forecasting and probabilistic forecasting problems. On the one hand, the residential load is uncertain and random, depending on varying individual behavior patterns, regional layouts, and complicated factors such as weather conditions, exhibiting random time-series characteristics [

11]-[14]. Reference [11] considers the practice theory of human behaviors to present a sampling model based on the Markov chain to predict the aggregated residential loads in a bottom-up style. References [12] and [14] use LSTM RNNs to process residential behaviors through the whole residential consumption and selected appliances, to generate short-term forecasts. Reference [13] disaggregates the total load of aggregated households into partial loads monitored by smart meters as well as others predicted via NNs. In summary, the data-driven methods, particularly LSTM RNNs, have exhibited superior generalization capability in generating residential load forecasts.

Furthermore, sequence-to-sequence (Seq2Seq) RNNs can describe the volatile temporal characteristics of residential consumption records in distribution networks [

15]. Reference [16] explores Seq2Seq RNNs for modeling unpredictable supply-demand imbalances caused by the variable nature of distributed renewable power generation in commercial or office buildings. Reference [17] attempts to predict the peak residential demand supplied by distribution network operators, who handle the energy policy and deploy demand response programs, by designing bidirectional LSTM (B-LSTM) Seq2Seq RNNs. As deep learning methods make the energy industry to be more reliable and sustainable, [18] confirms the feasibility of Seq2Seq RNNs for producing multi-step load forecasting in single households. Hence, Seq2Seq RNNs serve as a promising solution to processing input sequences (e.g., influential factors) and output sequences (e.g., multi-step targets). The investigations and explorations of Seq2Seq RNNs in residential load forecasting are imperative.

On the other hand, the probabilistic load forecasting solutions that can describe the uncertainty of predictions are upsurging [

19]-[21]. Specifically, [20] proposes a convolutional neural network (CNN) with squeeze-and-excitation modules to handle micrometeorological records and obtain day-ahead probabilistic load forecasts for residents. Reference [21] proposes a modified memristive LSTM RNN by characterizing variable-wise features and the temporal importance via a mixture of attention-based techniques to interpret the predictions produced by the model. In summary, the probabilistic load forecasting solutions are increasingly significant for households, particularly when integrating stochastic and intermittent renewables. However, there is a lack of the exploration of utilizing the adaptive Seq2Seq RNNs to get flexible probabilistic forecasts, illustrating the research gap in this topic.

Moreover, current studies seldom notice the limited implementation of the powerful Seq2Seq RNNs to the forecasting scheme for the new residential loads that generally lack enough data records. In other words, insufficient data make it difficult to generate accurate forecasts for the new residential load. To our knowledge, there remains no technical work focusing on new residential loads from real-life considerations, which motivates us to propose a probabilistic residential load forecasting scheme using limited data on the Seq2Seq RNN. The domain adaptation network is well-known in computer vision and other artificial intelligence fields, but its application to the residential load forecasting is not so well-known from expectation. Reference [

22] presents an ensemble model, where the updated data can be utilized to adjust the weights by a sample domain adaptation method called Tradaboost. Reference [23] improves the adversarial domain adaptation method through an initial state fusion strategy that analyzes adversarial disequilibrium and an information entropy index that quantifies domain similarity. In this context, we recognize the realistic condition, where sufficient data of regular loads cannot be directly used through deep learning methods to generate forecasts for the new residential load due to distribution shift. In other words, the limited data of new residential loads are not enough for leveraging the power of deep learning methods, and deep learning methods should not be directly used due to the assumption that the training data are independent and identically distributed and the risk of overfitting.

Hence, this paper proposes a short-term probabilistic residential load forecasting scheme. The aim of this study is to solve the load forecasting problem for the new residential loads, where many other regular loads have been connected to power system for a relatively long time. The distribution of energy demand of the new residential loads, which is difficult to approximate with insufficient or limited data, can be different from that of other regular loads that own relatively sufficient data. The innovation of the proposed scheme is the Seq2Seq adversarial domain adaptation network that can align distributions of sufficient data of regular loads and the limited data of new residential loads, thus reasonably meeting the assumption that the training data are independent and identically distributed before leveraging the potential power of deep learning methods to improve the forecasts, which is necessary but usually ignored. Thus, the proposed scheme can process sufficient data of regular loads to maintain model performance, though the data of the new residential loads are limited.

The contributions of this paper are as follows.

1) A Seq2Seq adversarial domain adaptation network and its joint training strategy are proposed to align the distribution of sufficient data of regular loads and the limited data of the new residential loads.

2) An LSTM cell based teacher forces the Seq2Seq RNN with quantile loss as the submodules of the Seq2Seq adversarial domain adaptation network to capture temporal dependencies between regular loads and the new residential loads.

3) The proposed Seq2Seq adversarial domain adaptation network is comprehensively compared with other classic methods in different horizons, criteria, and scarcity degrees of energy consumption observations.

The remainder of this paper is organized as follows. Section II formulates the short-term probabilistic residential load forecasting problem. Section III proposes the short-term probabilistic residential load forecasting scheme. Section IV performs a comprehensive case study to validate the proposed scheme using multiple evaluation indices, comparative techniques, and limited degrees. Section V draws the conclusions.

II. Formulation of Short-term Probabilistic Residential Load Forecasting Problem

This section will clarify the short-term probabilistic residential load forecasting problem from the short-term deterministic load forecasting problem and the conventional multi-step (such as 24-point day-ahead) load forecasting problem.

A. Short-term Deterministic Load Forecasting Problem

Given the dataset 𝒪tclean at the time t, we often process the chronological observations by identifying adequate models and optimizing the model assignments. Denote the current time as tc, the forecasting gap time as tg, the forecasting lead time (i.e., horizon) as th, and the number of horizons as Nhor. The records at a certain time otclean cover the elements that can be classified into the static covariate vector otstatic, the historical vector otpast, and the future vector otfuture, where t=1,2,,tc,,tc+th,,tc+Nhorth. The static covariate vector otstatic covers temporal-static factors such as the load location and the identity number of residents. The historical vector otpast=ot-tw:tobs,ot-tw:tknown involves the information available from the historical time, i.e., the observed factors ot-tw:tobs such as target loads and other public factors ot-tw:tknown, where tw is denoted as the span of the last records for each horizon. The future vector otfuture=ot:t+tg+thtarget,ot:t+tg+thknown involves the information for the future moments, i.e., the target vector ot:t+tg+thtarget and other pre-known factors ot:t+tg+thknown such as calendar rules. In regression, the target vector ot:t+tg+thtarget is the dependent variable yt, and the input variable xt involves historical, future, and static covariate vectors, i.e., f: xtyt,xt=ot-tw:tobs,ot-tw:t+tg+thknown, otstatic, yt=ot:t+tg+thtarget, as expressed in (1).

i:=f: xtyt|yt=fxt;θi,θiRNpar (1)

where f: xtyt reflects the mapping between the input xt and the output yt; i is the assignment space of the model i,i=1,2,,Ni, and Ni is the number of candidate models; θi is the parameter vector or matrix of the model i; and Npar is the number of parameters.

Given the dataset D of influential factors and target variables, the experts estimate the optimal parameter assignments θi* for the alternative models over varying optimal hyperparameters λi* that are tuned by trials and errors with the training data Dest=xtest,ytest, as expressed in (2). Let fi be the mapping described via model i. Experts then compare the performance of the alternative models to identify the adequate model to produce predictions y^test over the testing inputs xtest (a.k.a., features), assuming that true testing targets ytest remain unknown before deploying it, as shown in (3).

θi*=argminθiΘiRempθiargminθiΘiEDesti.i.d.𝒟fiθi,Dest|λi* (2)
f*=argminffi*i=1NiEDesti.i.d.𝒟𝒞fi,Dest|θi*,λi* (3)

where Θi is the set of θi; Rempθi is the empirical loss; EDesti.i.d.𝒟· is the expectation of the loss function · on the independent and identically distributed training data Dest under an unknown distribution 𝒟; and 𝒞· is a cost function to evaluate the forecasting accuracy.

B. Multi-step Load Forecasting Problem

When the lead time is a time step, the model generates a single scalar (i.e., point) at each time, where the load forecasting problem is defined as a single-step load forecasting problem in (4). When the lead time covers more than one observed moment for the whole next day or week, the model outputs a specific profile composed of multiple points at a time in the multi-step load forecasting problem. There are three major solutions to the multi-step load forecasting problem, i.e., the day-ahead forecasting problem in this study: ① combining multiple single-step load forecasting models, each of which produces forecasts independently, as depicted in (5); ② using a multi-step load forecasting model to generate interval-ahead curves once and for all, as given in (6), which is adopted in this study by deep learning methods; and ③ replacing the inputs with the last predicted values to generate the forecasts iteratively, as given in (7).

i:=f: xtytyt=fxt;θi,θiRNpar (4)
i,h:=f: xt,hyt,hyt,h=fixt,h;θi,h (5)
i:=f: xtytyt=fxt;θi (6)
i:=f: xt,hyt,hyt,h=fxt,h;θi (7)

where h is the index of horizon, h=1,2,,Nhor.

C. Probabilistic Load Forecasting Problem

The single-step load forecasting problem is defined as a deterministic forecasting problem when the model generates a deterministic value at each time, as given in (4). The deterministic forecasts are the most common form but cannot portray the uncertainty of the predictions. Thus, uncertain predictions in the forms of intervals, quantiles, and distribution densities are targeted in the probabilistic load forecasting problem. Each model depicts the probability distribution FYtXt with the random variable Yt of the target variable yt at the time t, given the random variable Xt of the input xt, i.e., YtFYtXt. The random variable Yt correlates to the conditional probability distribution Pt·yxt=Fytyxt for the target variable YtPtyytxt or the forecasting error Ytyt+Ptεεtxt given the input xt.

Moreover, the conditional probability distribution Pt·yxt can be represented by the probability distribution density Pt in (8), the quantile point yαq,t in (9), or the random prediction interval It in (10) [

24].

i:=f: xtPtPtyxt=fxt;θi (8)
i,q:=f: xαq,tyαq,tyαq,t=fxαq,t;θi,q (9)
i:=f: xtItIty|xt=fxt;θi (10)

where αq is the nominal quantile level, and q=1,2,,Q is the order of quantile prediction; θi,q is the parameter vector or matrix of model i for the quantile q; and Ity|xt=yα̲q,t,yα¯q,t is the prediction interval composed of lateral quantiles, and α¯q and α̲q are the upper and lower bounds of nominal quantile level, respectively.

Figure 1 shows the load forecasts at a certain moment in different forms. The quantile prediction can either discretely approximate the probability distribution density Pt or be selected to build the prediction interval It.

Fig. 1  Load forecasts at a specific moment in different forms. (a) Point, prediction interval, and probability density. (b) Point, quantile, and probability density.

III. Short-term Probabilistic Residential Load Forecasting Scheme

A. Overall Framework

The proposed scheme consists of the attention-based encoder-decoder network, adversarial domain adaptation network, LSTM RNNs, and quantile loss. Figure 2 depicts the overall framework of the proposed scheme. First, this framework takes the data of the new residential load, regular loads, and known factors as the samples. The data of the new residential load, i.e., ot-tw:tnew,ot-tw:t+tg+thknown, are divided into a testing set xtnew,ytnew from the starting time t0 to the current time tc and a training set xtest,ytest from the earliest time t0reg=maxtn=0:Nreg to the current time tc, where Nreg is the number of the historical periods of regular loads.

Fig. 2  Overall framework of proposed scheme.

The training set is also mixed with the data record related to regular loads and their known factors, i.e., ot-tw:treg,ot-tw:t+tg+thknown.

Generally, sufficient samples are required to identify flexible NNs with numerous parameters. The Seq2Seq adversarial domain adaptation network is proposed based on a feature extractor, a demand predictor, and a domain classifier by realizing a joint training process with the gradient reversal layer. For implementation, we adopt the generic elements as the submodules of the adversarial domain adaptation network including: ① the feature extractor by connecting an LSTM layer and a dropout layer, ② the demand predictor by connecting another LSTM layer and dropout layer, ③ a fully-connected feedforward layer as the domain classifier, and ④ dense layers after the demand predictor and domain classifier. In addition, we use an attention-based layer and a skip connection to capture the longer temporal dependency while mitigating the vanishing gradient problem, enhancing feature reuse, and facilitating the learning of identity mappings, which is motivated by the Seq2Seq RNN.

Both the data of the new residential load and regular loads can be leveraged via the Seq2Seq adversarial domain adaptation network, where the Seq2Seq RNN is well-trained by quantile loss before generating the multi-step probabilistic load forecasts. The optimal parameter assignments of the Seq2Seq adversarial domain adaptation network are represented by (11) and (12), and then the estimated network generates the final load forecasts by (13).

θ*,f,θ*,y,θ*,z=argminθf,θy,θzRempθ(·)|Dest (11)
θ*,dense=argminθdenseRempθdense|Dnew,θ*,f,θ*,y,θ*,z (12)
o^tnew=fproposedot-tw:tnew,ot-tw:t+tg+thknown;θ*,proposedGdenseGyGfxtnew;θ*,f;θ*,y;θ*,dense (13)

where Dnew is the dataset of the new load; θ*,proposed is the vector of parameters for the proposed scheme; o^tnew is the prediction of the new load; fproposed· is the function of the proposed Seq2Seq adversarial domain adaptation network; Gf· and θf are the feature extractor and the vector of its parameters, respectively; Gy· and θy are the demand predictor and the vector of its parameters, respectively; Gdense· and θdense are the output layer and the vector of its parameters, respectively; θz is the vector of parameters for domain classfier; and θ*,f, θ*,y, θ*,z, and θ*,dense are the optimal parameter assignments of θf, θy, θz, and θdense, respectively.

B. Adversarial Domain Adaptation Network

The parameters of the NN-based forecasting models are optimized by assuming the training data are independent and identically distributed. However, the accuracy of load forecasts could not be assured when the data distributions of the training and testing datasets vary, which is also known as a shift between data distributions of the training and test datasets. The concept of domain adaptation aims to learn a discriminative classifier or another predictor when there is a shift between data distributions of the training and test datasets, which is generally operated by matching the feature distributions in the source and target domains of synthetic or semi-synthetic image data. A dominant approach is to accomplish a feature space transformation that measures the similarity or dissimilarity between different distributions and maps the distributions of the source domain to the target domain [

25]. In addition, the domain adaptation in the NN has been proposed to learn the features that are both discriminative for the target learning task in the source domain and invariant to the shift between the domains [26]. The combination of domain adaptation and neural architectures can be achieved by jointly optimizing the two components (i.e., demand predictor and domain classifier) and their underlying features in the training processes of the feedforward neural architecture models.

On the one hand, the demand predictor predicts the class label of domains during the training and testing processes. On the other hand, during the training process, the domain classifier discriminates between the source and the target domains. The feature extractor that is connected to the demand predictor and the domain classifier learns the deep features with discriminative and domain-invariance capabilities. Specifically, the parameters of the two components are optimized to minimize their error on the training set, and the parameters of the feature extractor are optimized to minimize the losses of the demand predictor and the domain classifier. After the optimization, the feature extractor learned for the source domain can be implemented in the target domain. The error of the joint training processes of the feature extractor, demand predictor, and domain classifier on the adversarial domain adaptation network is calculated by:

Rempθf,θy,θz=t=1,2,,Tzt=0tyθf,θy-λt=1,2,,Ttzθf,θzt=1,2,,Tzt=0yGyGfXt;θf;θy,yt-λt=1,2,,TzGzGfXt;θf;θz,zt (14)

where Gz· is the domain classifier; T is the number of samples; zt is the domain classification of observation, zt=0 when the data at the time t belong to the target domain, zt=1 when the data at the time t belong to the source domain; ty· and tz· are the loss functions; and λ is the trade-off weight between the losses of domain classifier Gz and demand predictor Gy.

As a result, the deep feature from the feature extractor Gf represents a space transformation between the distribution of the output of the demand predictor Gy in the source domain Xtsour~𝒟sour and that in the target domain Xttar~𝒟tar, i.e.,

GyGfXtsour;θf;θyGyGfXttar;θf;θy~𝒟y𝒟sour𝒟tarXtsour~𝒟sourXttar~𝒟tar (15)

where 𝒟y is the true distribution of the outputs yt, t=1,2,,T.

Moreover, the parameters θ^f, θ^y, and θ^z are orderly optimized to deliver the saddle point of (14):

θ^f,θ^y=arg minθf,θyθf,θy,θ^zθ^z=arg minθzθ^f,θ^y,θz (16)

Specifically, the saddle point (16) is a stationary point of stochastic gradient descent (SGD) updates for the feedforward network model composed of a feature extractor Gf, a domain classifier Gd, and a demand predictor Gy:

θfθf-μtyθf-λtzθf (17)
θyθy-μtyθy (18)
θdθd-μtzθz (19)

where μ is the learning rate, which can vary over time.

However, we note that the minimization of the objective function (14), i.e., minRemp, includes a minimization optimization of prediction minty and a maximization of classification -minλtz, and thus a “pseudo-function gradient” reversal layer Lλx only with the hyperparameter λ proposed in [

26] is implemented to transform certain updates (17)-(19) into the standard form of SGD. Specifically, we adopt an identity transformation as the gradient reversal layer during the forward propagation, as given in (20). Then, the specified layer passes the gradient from the subsequent neural layer to the preceding layer by multiplying -λ during the backpropagation, which is computed by:

Lλx=x (20)
dLλdx=-λI (21)

where Lλ is a gradient reversal layer; and I is an identity matrix.

Based on Lλ, the modification of the objective function (14) with standard SGD forward propagation and backpropagation can be obtained as:

R˜empθf,θy,θz=t=1,2,,Tzt=0tyGyGfXt;θf;θy,yt+t=1,2,,TtzGzLλGfXt;θf;θz,zt (22)

C. LSTM RNNs

The RNN is a promising approach for solving load forecasting problems because the record of energy consumption often exhibits temporal characteristics. RNNs proposed for processing sequential data (such as speech, multivariate time series, and text) can leverage the time interdependency in the chronological values through the ideas of parameter-sharing and graph-unrolling [

9], [27]. RNNs combine the input at the moment t xtRNN and the hidden state at the last moment ht-1RNN, and produce the current hidden state htRNN, as described in (23). The output at the moment t ytRNN is then computed with the hidden state ht-1RNN, as described in (24). In multilayer networks, the recurrent neural layer composed of RNN cells can be used as a specific classification of the hidden layer. The hidden state htRNN will be different if the chronological order of the timestamps of the input xtRNN changes.

htRNN=tanhWRNNht-1RNN+URNNxtRNN+b2RNN (23)
ytRNN=fRNNVRNNhtRNN+b1RNN (24)

where tanh· is a hyperbolic tanh function; WRNN,URNN, and VRNN are the matrices of weights; and b1RNN and b2RNN are the vectors of biases.

The LSTM RNN adopts the gating mechanism to alleviate the gradient vanishing problem in conventional RNNs for modeling relatively long short-term dependency. It creates the path whose derivatives neither vanish nor explode to use the early temporal dependency (the hidden state) over the connection weights. The gradients flow via self-loops, where the weights are conditioned to the given data, as depicted in Fig. 3.

Fig. 3  Diagram of RNN variants. (a) Recurrent graph, unrolled graph, and internal cell structure of RNNs. (b) Internal cell structure of LSTM RNNs.

Specifically, the LSTM RNN takes an outer recurrence as the conventional RNN, an outer recurrence as a generic RNN, and an internal recurrence as the LSTM cell, as given in (25)-(30). In the outer recurrence, the element-wise mediate variable ztLSTM correlates to the affine transformation of the input variable xtLSTM and the hidden layer vector ht-1LSTM from the last moment. In the internal recurrence, the internal state vector at the given time t stLSTM depends on the variable ztLSTM that relates to the input gate gtin and the variable zt-1LSTM from the last moment that relates to the forget gate gtforget. The LSTM cell ytLSTM, i.e., the hidden layer vector htLSTM, derives from the signal from the output gate gtout. This process has been depicted in [

28].

ztLSTM=tanhWLSTMht-1LSTM+ULSTMxtLSTM+bLSTM (25)
stLSTM=ztLSTMgtin+zt-1LSTMgtforget (26)
ytLSTM=htLSTM=tanhstLSTMgtout (27)
gtin=σWinht-1LSTM+UinxtLSTM+bin (28)
gtforget=σWforgetht-1LSTM+UforgetxtLSTM+bforget (29)
gtout=σWoutht-1LSTM+UoutxtLSTM+bout (30)

where σ· is the Sigmoid function, which is another activation function besides tanh·; and W·, U·, and b· are the specific parameters in the LSTM cell.

D. Attention-based Encoder-decoder Network

The Seq2Seq RNN exhibits superior capability in modeling temporal characteristics between the input and output sequences by leveraging the local context around the target. The Seq2Seq RNN was first designed and used in computer vision, speech recognition, and natural language processing [

15]. Unlike conventional RNNs, it contains an encoder to process sequential data of the influential factors, a decoder to generate multi-step load forecasts, and possibly an attention vector to mark essential dependencies in the sequence, as discussed in [28].

Figure 4 illustrates the encoder-decoder network with mixed inputs, where the encoder and decoder networks are two individual LSTM RNNs. The post-known factors, i.e., observed factors ot-tw:tobs and known factors ot-tw:tknown, are the inputs of the encoder. The encoder receives the input xt-tw:tenc and extracts the local temporal context cenc from the hidden state of encoder htenc as the first hidden state of decoder. The decoder combines the local context cenc, the ground truth input xt+tg:t+tg+thdec of pre-known factors ot:t+tg+thknown, and the self-generated input y^t+tg:t+tg+thdec to obtain predictions y^t+tg:t+tg+thdec related to the hidden states ht+tgdec,ht+tg+th/nhdec,,ht+tg+thdec that can be evaluated via the targets yt+tg:t+tg+thdec during the lead time. The green dotted line represents that the prediction at the last moment is the input of the next moment.

Fig. 4  Encoder-decoder network with mixed inputs.

E. Bayesian Target Encoding Based Data Pre-processing

Existing works usually consider calendar-related information as categorial features such as the hours in a day, the days in a week, the months in a year, the distinction between holidays and non-holidays, the distinction between weekdays and weekends, and the distinction between varying seasons, which can be used as part of the inputs of the load forecasting model after encoding. The one-hot encoding approach is one of the most common encoding approaches for categorial features. Figure 5 shows a common example of the one-hot encoding approach on monthly calendar-related features. We can see that the number of newly supplemented 0-1 features could be massive when the original feature involves too many categories, e.g., from 1 to 12 dimensions.

Fig. 5  Example of one-hot encoding approach on monthly calendar-related features.

To address the explosion of inputs by the one-hot encoding approach, we adopt the Bayesian target encoding to pre-process categorial features. Specifically, the values of the categorial features are compared with average observations of the load in the corresponding categorical values, as depicted in Fig. 6, thereby depicting the relationship between continuous and categorical features more explicitly, where L1-L8760 are the hourly loads in a year; and L¯Jan.-L¯Dec. are the average monthly loads from January to December.

Fig. 6  Example of Bayesian target encoding on monthly calendar features.

Moreover, we utilize a Z-score normalization method to map the values of the target load and its influential factors with different dimensions into a specific range [

29]. The data y˜t calculated by (31) should follow the standard normal distribution, whose mean is 0 and standard deviation is 1.

y˜t=yt-1Tt=1Tyt1Tt=1Tyt-1Tt=1Tyt2-1 (31)

F. Miscellanies

To evaluate the model performance, we compare load forecasts and true values via quantile score (QS) and Winkler score (WS) indexes.

The QS index calculated by (32) and (33) represents the mean of pinball losses throughout the lead time and all quantiles, respectively. A lower QS result indicates more precise forecasts compared with the ground truth values.

αq,tquaGyGfXt;θαqf;θαqy,ytλ*αq,tquay^αq,t,ytλ*=qy^αq,t-yty^αq,tytq-1y^αq,t-yty^αq,t<yt (32)
Rqsθf,θy=1QTq=1Qt=1Tαq,tquaGθαqyyGθαqffXt,ytλ* (33)

where αq,tqua is the quantile loss function for the nominal quantile level αq at the time t; λ* is the optimal hyperparameter; y^αq,t is the prediction for quantile αq at the time t; and Rqs is the average quantile loss.

Based on the quantile loss, the loss function of the dense layer after the feature extractor, the gradient update (with the impact on the loss θdenseαq,t), and the optimization process is established by (34)-(36), respectively.

αq,tdenseθdense=αq,tyθdense;θ^*,f,θ^*,y,y^αq,tnew,ytnew (34)
θdense=θdense-μαq,tdenseθdense (35)
θ^dense=argminθdenseαq,tdenseθ^*,f,θ^*,y,θdense (36)

where αq,tdense is the dense loss function for the nominal quantile level αq at the time t.

The gradient updates (16). Besides, the loss function (22) with the loss θf,θy,θzαq,tproposed can be updated as:

αq,tproposedθf,θy,θz=αq,tyθf,θy;y^αq,test,load,ytest,load-λαq,tzθf,θz;z^test,label,ztest,label (37)
θ^f,θ^y=argminθf,θyαq,tyθf,θy,θ^z (38)
θ^z=argminθzαq,tzθ^f,θ^y,θz (39)

where αq,tproposed is the loss function of the proposed frame for the nominal quantile level αq at the time t; y^αq,test,load is the prediction of the new load for the nominal quantile level αq at the time t; ytest,load is the ground truth of load at the time t; z^test,label is the prediction of the domain label at the time t; and ztest,label is the ground truth label at the time t.

The WS index evaluates the sharpness and reliability of the prediction intervals constrained to the quantile bounds by (40) and (41). A lower WS is generally desired because an overly wide interval is meaningless [

30].

tinty^αq,t,ytλ*=y^α¯q,t-y^α̲q,ty^α̲q,tyty^α¯q,ty^α¯q,t+2β-1y^α̲q,t-2βytyt<y^α̲q,t1-2βy^α¯q,t-y^α̲q,t+2βytyt>y^α¯q,t (40)
Rwsθf,θy=1Tt=1Ttinty^αq,t,ytλ* (41)
α¯q-α̲q=βα̲q=1-α¯q=β2 (42)

where tint is the interval loss function at time t; Rws is the average WS loss; and β is the confidence level relating to α¯q and α̲q.

In addition, the categorical cross-entropy (CCE) index, which is popular for classification problems, is adopted in (43)-(46) to compute the domain classification error. The ratio between the scales of samples from the source domain and the target domain γ is considered. The criterion makes the gradients computable when training the classification model. A lower CCE index ECCEθf,θz generally represents better classification capability. However, we require a relatively high classification error to achieve the distributional consistency of the output of the demand predictor Gy from source domain inputs GθyyGθffXtsour and target domain inputs GθyyGθffXttar, across the connection between the feature extractor Gf, the gradient reversal layer Lλx, and the domain classifier Gz, i.e., GθzzLλGθffXt, in this study.

tCCESoftmaxGθzzLλGθffXt,ztλ*tCCEz^t,ztλ*=-z^tlnpt+γ1-z^tln1-pt (43)
Softmaxzt=eztt=1Tezt (44)
γ=TsourTtar (45)
RCCEθf,θz=1Tt=1TtCCESoftmaxGθzzLλGθffXt,ztλ* (46)

where tCCE is the CCE loss function at the time t; RCCE is the average CCE loss; Tsour is the number of the source domains; Ttar is the number of target domains; pt is the probability of the sample from the target domain at the time t; and Softmax· represents the normalized exponential function.

IV. Case Study

A. Dataset Description

A widely accepted real-life dataset published in [

31] is utilized in this subsection. Specifically, the dataset contains hourly power consumption data of the entire New England system and eight parallel load zones including Rhode Island, New Hampshire, Northeast Massachusetts and Boston, West/Central Massachusetts, Southeast Massachusetts, Vermont, Maine, and Connecticut from January 2022 to March 2023. To implement the proposed scheme, we select two varying loads as the source domain with massive records of historical/regular loads and the target domain with limited data of new residential loads, and simulate diverse limited degrees of the source and target domains via the availability of corresponding records. For the whole dataset, the data are split into the training set (the samples before those in the validation and testing sets), the validation set (the next 672 samples before the testing set), and the testing set (the last 240 samples). When a certain proportion of the whole dataset is available, the same proportion of the whole training, validation, and testing sets would be used for simulation. Before training the model, we shuffle the order of the samples in the training, validation, and testing sets.

Generally, the similarity should be evaluated before adapting the knowledge in the source domain to the target domain. As a preliminary similarity assessment, we pick the New Hampshire load zone as the target domain and other load zones as potential source domains. A dominant K-shape clustering algorithm is used to process the eight data sources of energy consumption and calculate the similarity between them [

32]. Table I exhibits the assessment results generated by monthly and half-year data, which are evaluated by the shape-based distance (SBD) index covering three main clusters, one of which includes load zones of Maine, Connecticut, Northeast Massachusetts and Boston, New Hampshire, Southeast Massachusetts, and Rhode Island. Thus, we assume New Hampshire as the target domain and select the most similar zones, Massachusetts and Boston, as the source domain for case studies. We note that the K-shape clustering algorithm can be used in other cases to identify the correlated source domains for the target domain.

TABLE I  Assessment Results Using K-shape Clustering Algorithm
Load zonesMonthlyHalf-year
SBDNo. of clustersSBDNo. of clusters
West/Central Massachusetts 0.01693278 0 0.01404867 0
Vermont 0.01923366 1 0.02019146 1
Rhode Island 0.00092255 2 0.00131910 2
New Hampshire 0 2 0 2
Northeast Massachusetts and Boston 0.00065996 2 0.00102957 2
Southeast Massachusetts 0.00141926 2 0.00289324 2
Maine 0.00068811 2 0.00220722 2
Connecticut 0.00082693 2 0.00192924 2

B. Experimental Settings

Before training the customized Seq2Seq RNN in adversarial domain adaptation, we manually determine the assignments of special parameters (i.e., hyperparameters) and tune them within specific ranges and numerical sets. This process involves experimental settings. In this study, these experimental settings are mainly divided into two groups: ① the training settings for the Seq2Seq RNN in adversarial domain adaptation in the proposed scheme, ablative analysis, and comparative models; and ② the model settings for the customized RNN itself. The experimental settings are summarized in Table II. The proposed scheme in the case study generates the 5%, 50%, and 95% quantile forecasts, which are evaluated by the QS and WS indexes. According to the results of the preliminary test, the target coverage level of 90% is enough for evaluating the performance of prediction intervals, and other quantiles such as 10%, 20%, 60%, and 70% can be covered by the interval between the 5% and 90% quantile forecasts.

TABLE II  Summary of Experimental Settings
ClassificationHyperparameterValue
Training setting Length of input sequence 24×1
Length of output sequence 24×1
Epoch number 10000
Repetition number 10
Early stopping patience 50
Optimization algorithm AdaM
Number of random searching 50
Model setting Learning rate 0.001, 0.0001, 0.00001
EL 1, 2, 3, 4, 5
EN 10, 20, 40, 80, 160
DL 1, 2, 3, 4, 5
DN 10, 20, 40, 80, 160
Batch size 64, 128, 256
Activation function type tanh(·)
Dropout rate 0.1, 0.2, 0.3, 0.4, 0.5

Note:   EL, EN, DL, and DN are the numbers of encoder layers, encoder neurons per layer, decoder layers, and decoder neurons per layer, respectively.

We identify the appropriate assignments of the model settings (e.g., the learning rate, the batch size, and the dropout rate [

33]) by repeating random searching procedures [34] for 80 times among 3×3×5 groups of possible assignments. The numbers in bold in Table II are the identified results after 80 searches. Besides, we explore the assignments of EL, EN, DL, and DN through the sensitivity analysis to verify the stability and feasibility of the proposed scheme.

With the empirically optimal experimental settings, we repeat the training process (i.e., optimizing model parameters) and the testing process (obtaining load forecasts and comparing them with ground truth values) for ten times to ensure the reproducibility and reliability of the results.

The case study is realized via the Python language (3.9.7), the TensorFlow wheel (2.7) [

35], the Scikit-learn wheel (1.2.2) [36], the MAPIE wheel (0.6.4) [37], CUDA Toolkit platform (11.2.0), and the cuDNN library (8.1) through a machine of the Intel Core CPU (i7-11800H @ 2.30 Giga Hz), the RAM (16.0 Giga Bytes), and the NVIDIA GeForce GPU (RTX 3060 @ 8.0 Giga Bytes).

C. Sensitivity Analysis

To validate the feasibility of the proposed scheme, we conduct a detailed sensitivity analysis in terms of layers and neurons per layer in the encoder and decoder. Specifically, EL and DL are set to be 1, 2, 3, 4, and 5, and EN and DN range from 10 to 160. We then build varying Seq2Seq RNNs over the combinations of the two hyperparameter values. Similarly, the classification- and regression-related indexes are used to evaluate the model performance difference with varying assignments of hyperparameter. In this subsection, we assume that 100% of the source and target domains are available.

Table III shows the sensitivity analysis results of the proposed scheme with varying assignments of hyperparameters. From the average QS and the standard deviation for ten trials, the optimal numbers of EL, EN, DL, and DN should be 2, 40, 2, and 40, respectively, which have been used as the default configuration. Specifically, the lowest QS drops at the value of 0.00266. The increase of EL, EN, DL, and DN brings more flexibility to capture the time-series characteristic, resulting in the QS reduction from 0.01071 to 0.00266. Nevertheless, too much flexibility causes overfitting to the limited data, and the QS thus increases again when the number of neural layers is over 2 or the number of neurons per layer is over 40. In overfitting, the number of layers shows much more influence on the QS (from 0.00266 to 0.01071) than the number of neurons per layer (from 0.00266 to 0.00353), conforming to the consensus that the deeper network is more adaptive than the wider network.

TABLE III  Sensitivity Analysis of Proposed Scheme with Varying Assignments of Hyperparameters
HyperparameterAverage QSAverage WS
ELENDLDN
1 40 2 40 0.00278 (0.00001) 0.02613 (0.0001)
2 0.00266 (0.00001) 0.02461 (0.0001)
3 0.00267 (0.00001) 0.02362 (0.0001)
4 0.00312 (0.00002) 0.02077 (0.0001)
5 0.01071 (0.00001) 0.06536 (0.0001)
2 10 2 40 0.00316 (0.00002) 0.02291 (0.0001)
20 0.00267 (0.00001) 0.03284 (0.0002)
80 0.00344 (0.00001) 0.02490 (0.0001)
160 0.00353 (0.00001) 0.03123 (0.0002)
2 40 1 40 0.00316 (0.00002) 0.02869 (0.0003)
3 0.00348 (0.00002) 0.02236 (0.0001)
4 0.00501 (0.00004) 0.08236 (0.0007)
5 0.00597 (0.00005) 0.07154 (0.0006)
2 40 2 10 0.00299 (0.00001) 0.03160 (0.0002)
20 0.00269 (0.00001) 0.03320 (0.0002)
80 0.00313 (0.00002) 0.02006 (0.0001)
160 0.00344 (0.00002) 0.02474 (0.0001)

Note:   the values in brackets are standard deviations.

Furthermore, the encoder exhibits more impact on the QS (from 0.00266 to 0.01071 in terms of EL, or from 0.00266 to 0.00353 in terms of EN) than the decoder (from 0.00266 to 0.00597 in terms of DL or from 0.00266 to 0.00344 in terms of DN). This empirically confirms that the encoder of the feature extractor Gf plays an essential role in the proposed scheme. Similarly, it should be considered that the optimal assignments for the lowest QS could cause narrow prediction intervals because of the high accuracy of the quantile forecasts and the ground truth values. Although the default configuration does not exhibit the best average WS (β=90%) and the standard deviation for ten trials, we recommend the suboptimal assignments of hyperparameters to balance QS and WS when implementing the proposed scheme.

D. Scenario Exploration

To vividly illustrate the performance of the proposed scheme in addressing the data lack phenomenon, we comprehensively simulate diverse scenarios related to the data availability of both source and target domains, and compare their performances. In other words, we evaluate the errors between ground truth values and prediction intervals as well as quantiles, when various proportions of samples from the two domains (10%, 20%, 40%, 60%, 80%, and 100%) are available, as summarized in Table IV. We use both regression-related criteria QS and WS (β=90%) and the classification index CCE to evaluate forecasting errors. It should be noted that EL, EN, DL, and DN are set to be 2, 40, 2, and 40, respectively.

TABLE IV  Performance of Proposed Scheme
Proportion of samples (%)Average CCEAverage QSAverage WS
Source domainTarget domain
100 100 0.2500 (0.008) 0.00266 (0.00001) 0.02461 (0.0001)
80 0.2111 (0.006) 0.00331 (0.00002) 0.02306 (0.0001)
60 0.1687 (0.005) 0.00267 (0.00001) 0.03611 (0.0002)
40 0.1214 (0.004) 0.00351 (0.00003) 0.02442 (0.0001)
20 0.0667 (0.002) 0.00286 (0.00002) 0.03120 (0.0002)
10 0.0352 (0.001) 0.00319 (0.00002) 0.03129 (0.0002)
100 80 0.2361 (0.007) 0.00298 (0.00001) 0.02240 (0.0001)
60 0.2187 (0.006) 0.00312 (0.00002) 0.02542 (0.0001)
40 0.1964 (0.006) 0.00509 (0.00004) 0.02142 (0.0001)
20 0.1667 (0.005) 0.00350 (0.00002) 0.02632 (0.0001)
10 0.1477 (0.005) 0.00471 (0.00004) 0.02548 (0.0001)
80 80 0.2500 (0.001) 0.00338 (0.00002) 0.02741 (0.0002)
60 60 0.2500 (0.001) 0.00363 (0.00002) 0.02963 (0.0002)
40 40 0.2500 (0.001) 0.00402 (0.00003) 0.02753 (0.0001)
20 20 0.2500 (0.001) 0.00414 (0.00003) 0.02721 (0.0001)
10 10 0.2500 (0.001) 0.00862 (0.00006) 0.06526 (0.0003)

Note:   the values in brackets are standard deviations.

From Table IV, the average CCE and its standard deviation for ten trials reduce as the gap between the proportions of samples from the source and target domains becomes larger, and the lowest value is 0.0352 when utilizing 10% samples from the source domain and 100% samples from the target domain. Meanwhile, the CCE approaches and maintains at 0.2500 when the proportions of the source and target domains become similar. On the other hand, we obtain the lowest average QS and its standard deviation when all samples are available, i.e., 100% samples from both the source and target domains are utilized. However, the prediction interval can be narrow and tight if probabilistic forecasts are close to ground truth values. In summary, the proposed scheme shows the ability to extract deep features that transfer knowledge from the source domain to the target domain and decreases the QS from 0.00862 (when 10% samples from the source domain are utilized) to 0.00471 (when 100% samples from the source domain are utilized).

Moreover, Table IV confirms the adversarial manner in the proposed scheme. The implicit features from the feature extractor Gf make the records of regular loads an advantage to the demand predictor Gy in generating the energy demand predictions for the new load while confusing the domain classifier Gz in judging which domain a specific sample is from. Specifically, the CCE index increases from 0.0352 to 0.2500, while the average QS decreases from 0.00319 to 0.00266 as the proportion of samples from the source domain grows from 10% to 100%.

E. Comparison of Proposed Scheme and Other Schemes

To prove the superiority of the proposed scheme, we compare probabilistic forecasts generated by machine learning and deep learning schemes. The machine learning schemes include random forests (RFs) and gradient boosting decision trees (GBDTs). Deep learning schemes include generic fully connected feedforward NN (gen-FFNN), residual FFNN (res-FFNN), gated recurrent unit (GRU) RNN, LSTM RNN, generic temporal convolutional network (gen-TCN), conditional TCN (con-TCN), and WaveNet. In addition, static and teacher force (TF) Seq2Seq RNNs without domain adaptation are applied as ablative models to compare with the proposed scheme. We also utilize the QS and WS indexes to evaluate the forecasting results obtained from these schemes with different proportions of samples from the two domains, as summarized in Table V. The settings of comparative schemes are determined according to [

38]-[42].

TABLE V  Forecasting Results of Proposed Scheme Compared with Other Schemes
Scheme100% samples from target domain10% samples from target domain
QSWSQSWS
GBDT 0.00253 (0.00001) 0.02946 (0.0003) 0.00666 (0.00001) 0.02603 (0.0001)
RF 0.00269 (0.00001) 0.07370 (0.0007) 0.00648 (0.00003) 0.05934 (0.0002)
Gen-FFNN 0.00317 (0.00002) 0.08682 (0.0008) 0.00636 (0.00003) 0.22743 (0.0011)
Res-FFNN 0.00317 (0.00002) 0.08645 (0.0008) 0.00623 (0.00003) 0.21977 (0.0010)
LSTM RNN 0.00266 (0.00001) 0.04991 (0.0005) 0.01715 (0.00060) 0.05719 (0.0003)
GRU RNN 0.00291 (0.00001) 0.05945 (0.0005) 0.01419 (0.00050) 0.06197 (0.0003)
Gen-TCN 0.01144 (0.00010) 0.03771 (0.0003) 0.03312 (0.00150) 0.13722 (0.0006)
Con-TCN 0.00775 (0.00007) 0.03559 (0.0003) 0.02412 (0.00100) 0.10000 (0.0004)
WaveNet 0.00845 (0.00007) 0.04233 (0.0004) 0.01082 (0.00050) 0.17805 (0.0100)
TF Seq2Seq RNN 0.00258 (0.00001) 0.01209 (0.0001) 0.09575 (0.00041) 0.43193 (0.0020)
Static Seq2Seq RNN 0.00323 (0.00002) 0.01173 (0.0001) 0.09575 (0.00040) 0.43193 (0.0020)
Proposed 0.00266 (0.00001) 0.02461 (0.0002) 0.00471 (0.00002) 0.02548 (0.0001)

Note:   the values in brackets are standard deviations.

From Table V, the TF Seq2Seq RNN accomplishes the lowest average QS of 0.00258 with the standard deviation of 0.00001 for ten trials as expected, and the static Seq2Seq RNN reaches the lowest WS of 0.01173 when β is 90%, given 100% samples from the target domain. In this half, when the dataset is sufficient, the proposed scheme outperforms the most dominant schemes with the same QS as the LSTM RNN of 0.00266. However, the task becomes much more difficult when the scale of the samples is limited to only 10% of the entire dataset, which simulates potential situations of the new residential load. The limited data have resulted in the degradation of all comparative schemes. For example, the QS values of static and TF Seq2Seq RNNs without adversarial domain adaptation have degraded from 0.00258 and 0.00323 to 0.09575, respectively, and the WS values have reduced from 0.01209 and 0.01173 to 0.43193, respectively.

Meanwhile, the proposed scheme exhibits its superiority in leveraging sufficient records of regular loads and supplementing the available dataset when training the adaptive Seq2Seq RNN. Therefore, the proposed scheme keeps generating accurate forecasts and can accomplish the best performance in terms of both the QS index (0.00471) and the WS index (0.02548). Given the entire target domain, we validate the performance through the proposed scheme with true profiles by vividly illustrating a group of day-ahead quantile predictions and the ground truth values, as shown in Fig. 7, in which the 50% quantile forecasts fit the ground truth value well. The interval between 5% and 95% quantile forecasts also covers the ground truth values as expected. On the other hand, Fig. 8 depicts the profiles of the 5%, 50%, and 95% quantile forecasts and the target when only 10% samples of the target domain is available, further demonstrating the effective coverage of the ground truth values and the fitting capability.

Fig. 7  Day-ahead quantile predictions and ground truth values given 100% samples from target domain.

Fig. 8  Day-ahead quantile forecasts and ground truth values given 10% samples from target domain.

V. Conclusion

The proportion and scale of renewable power generation such as solar power in the distribution system keep increasing, so it is imperative to develop load forecasting technologies to obtain precise net load profiles for planning and dispatching the power system in the context of penetrating renewables. This paper focuses on the volatile residential load series and addresses the data lack problem as a significant branch in the field of probabilistic load forecasting. The proposed scheme included a Seq2Seq RNN over two LSTM layers as the feature extractor and the demand predictor, respectively, and a fully connected feedforward layer as the domain classifier.

To implement the adversarial domain adaptation network, we mix historical records and newly collected residential load observations, train the Seq2Seq adversarial domain adaptation network with samples from source and target domains, and generate accurate forecasts.

In the case study, we investigate the stability and feasibility of the proposed scheme for day-ahead probabilistic forecasting by limiting the scale of available data from the source or target domains. The results show that the methods or techniques widely accepted may lose their extraordinary capability and become vulnerable when data resources are inevitably limited or insufficient. Meanwhile, although the Seq2Seq RNN is often fed with massive data, the proposed scheme can maintain robust performance for precise load forecasts as we gradually reduce the available scales of the source and target domains. This finding can inspire further discussions and investigations of new technologies to deal with the data lack phenomenon in this area. Future work will consider the attention mechanism when integrating domain adaptation into a Seq2Seq RNN.

References

1

W. Liao, S. Wang, B. Bak-Jensen et al., “Ultra-short-term interval prediction of wind power based on graph neural network and improved bootstrap technique,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 4, pp. 1100-1114, Jul. 2023.. [Baidu Scholar] 

2

J. Zhu, H. Dong, W. Zheng et al., “Review and prospect of data-driven techniques for load forecasting in integrated energy systems,” Applied Energy, vol. 321, p. 119269, Sept. 2022. [Baidu Scholar] 

3

IEA. (2019, Dec.). Renewables 2019. [Online]. Available: https://www.iea.org/reports/renewables-2019/distributed-solar-pv [Baidu Scholar] 

4

IEA. (2021, Dec.). Renewables 2021. [Online]. Available: https://www.iea.org/reports/renewables-2021 [Baidu Scholar] 

5

Q. Cui, J. Zhu, J. Shu et al., “Comprehensive evaluation of electric power prediction models based on D-S evidence theory combined with multiple accuracy indicators,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 597-605, May 2022. [Baidu Scholar] 

6

L. Ghelardoni, A. Ghio, and D. Anguita, “Energy load forecasting using empirical mode decomposition and support vector regression,” IEEE Transactions on Smart Grid, vol. 4, no. 1, pp. 549-556, Mar. 2013. [Baidu Scholar] 

7

H. Shi, M. Xu, and R. Li, “Deep learning for household load forecasting – a novel pooling deep RNN,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 5271-5280, Sept. 2018. [Baidu Scholar] 

8

H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for short-term load forecasting: a review and evaluation,” IEEE Transactions on Power Systems, vol. 16, no. 1, pp. 44-55, Feb. 2001. [Baidu Scholar] 

9

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436-444, May 2015. [Baidu Scholar] 

10

T. Elsken, J. H. Metzen, and F. Hutter. Automated Machine Learning. Cham: Springer, 2019: 63-77. [Baidu Scholar] 

11

B. Stephen, X. Tang, P. R. Harvey et al., “Incorporating practice theory in sub-profile models for short term aggregated residential load forecasting,” IEEE Transactions on Smart Grid, vol. 8, no. 4, pp. 1591-1598, Jul. 2017. [Baidu Scholar] 

12

W. Kong, Z. Y. Dong, D. J. Hill et al., “Short-term residential load forecasting based on resident behaviour learning,” IEEE Transactions on Power Systems, vol. 33, no. 1, pp. 1087-1088, Jan. 2018. [Baidu Scholar] 

13

J. Ponoćko and J. V. Milanović, “Forecasting demand flexibility of aggregated residential load using smart meter data,” IEEE Transactions on Power Systems, vol. 33, no. 5, pp. 5446-5455, Sept. 2018. [Baidu Scholar] 

14

W. Kong, Z. Y. Dong, Y. Jia et al., “Short-term residential load forecasting based on LSTM recurrent neural network,” IEEE Transactions on Smart Grid, vol. 10, no. 1, pp. 841-851, Jan. 2019. [Baidu Scholar] 

15

K. Cho, B. van Merrienboer, C. Gulcehre et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1724-1734. [Baidu Scholar] 

16

E. Skomski, J. Y. Lee, W. Kim et al., “Sequence-to-sequence neural networks for short-term electrical load forecasting in commercial office buildings,” Energy and Buildings, vol. 226, p. 110350, Nov. 2020. [Baidu Scholar] 

17

N. Mughees, S. A. Mohsin, A. Mughees et al., “Deep sequence to sequence Bi-LSTM neural networks for day-ahead peak load forecasting,” Expert Systems with Applications, vol. 175, p. 114844, Aug. 2021. [Baidu Scholar] 

18

Z. Masood, R. Gantassi, Ardiansyah et al., “A multi-step time-series clustering-based Seq2Seq LSTM learning for a single household electricity load forecasting,” Energies, vol. 15, no. 7, p. 2623, Apr. 2022. [Baidu Scholar] 

19

M. Shepero, D. van der Meer, J. Munkhammar et al., “Residential probabilistic load forecasting: a method using Gaussian process designed for electric load data,” Applied Energy, vol. 218, pp. 159-172, May 2018. [Baidu Scholar] 

20

L. Cheng, H. Zang, Y. Xu et al., “Probabilistic residential load forecasting based on micrometeorological data and customer consumption pattern,” IEEE Transactions on Power Systems, vol. 36, no. 4, pp. 3762-3775, Jul. 2021. [Baidu Scholar] 

21

C. Li, Z. Dong, L. Ding et al., “Interpretable memristive LSTM network design for probabilistic residential load forecasting,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 6, pp. 2297-2310, Jun. 2022. [Baidu Scholar] 

22

S. Li, Y. Zhong, and J. Lin, “AWS-DAIE: incremental ensemble short-term electricity load forecasting based on sample domain adaptation,” Sustainability, vol. 14, no. 21, p. 14205, Oct. 2022. [Baidu Scholar] 

23

M. Huang and J. Yin, “Research on adversarial domain adaptation method and its application in power load forecasting,” Mathematics, vol. 10, no. 18, p. 3223, Sept. 2022. [Baidu Scholar] 

24

J. Wang, X. Xiong, Z. Li et al., “Wind forecast-based probabilistic early warning method of wind swing discharge for OHTLs,” IEEE Transactions on Power Delivery, vol. 31, no. 5, pp. 2169-2178, Oct. 2016. [Baidu Scholar] 

25

M. Baktashmotlagh, M. T. Harandi, B. C. Lovell et al., “Unsupervised domain adaptation by domain invariant projection,” in Proceedings of 2013 IEEE International Conference on Computer Vision, Sydney, Australia, Dec. 2013, pp. 769-776. [Baidu Scholar] 

26

Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, Jul. 2015, pp. 1180-1189. [Baidu Scholar] 

27

F. He, J. Zhou, Z. Feng et al., “A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm,” Applied Energy, vol. 237, pp. 103-116, Mar. 2019. [Baidu Scholar] 

28

H. Dong, J. Zhu, S. Li et al., “Short-term residential household reactive power forecasting considering active power demand via deep Transformer sequence-to-sequence networks,” Applied Energy, vol. 329, p. 120281, Jan. 2023. [Baidu Scholar] 

29

Q. Cui, J. Zhu, J. Shu et al., “Comprehensive evaluation of electric power prediction models based on D-S evidence theory combined with multiple accuracy indicators,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 597-605, May 2022. [Baidu Scholar] 

30

W. Zhang, H. Quan, O. Gandhi et al., “Improving probabilistic load forecasting using quantile regression NN with skip connections,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5442-5450, Nov. 2020. [Baidu Scholar] 

31

ISO New England Inc. (2022, Dec.). Energy, load, and demand reports. [Online]. Available: https://www.iso-ne.com/ [Baidu Scholar] 

32

J. Paparrizos and L. Gravano, “k-shape: efficient and accurate clustering of time series,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, Jan. 2015, pp.1855-1870. [Baidu Scholar] 

33

N. Srivastava, G. Hinton, A. Krizhevsky et al., “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan. 2014. [Baidu Scholar] 

34

J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. 2, Feb. 2012. [Baidu Scholar] 

35

M. Abadi, A. Agarwal, P. Barham et al. (2016, Mar.). TensorFlow: large-scale machine learning on heterogeneous distributed systems. [Online]. Available: https://arxiv.org/abs/1603.04467 [Baidu Scholar] 

36

F. Pedregosa, G. Varoquaux, A. Gramfor et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825-2830, Nov. 2011. [Baidu Scholar] 

37

V. Taquet, V. Blot, T. Morzadec et al., (2022, July). MAPIE: an open-source library for distribution-free uncertainty quantification. [Online]. Available: https://arxiv.org/abs/2207.12274 [Baidu Scholar] 

38

A. Gasparin, S. Lukovic, and C. Alippi. (2019, Jul.). Deep learning for time series forecasting: the electric load case. [Online]. Available: https://arxiv.org/abs/1907.09207 [Baidu Scholar] 

39

K. Chen, K. Chen, Q. Wang et al., “Short-term load forecasting with deep residual networks,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3943-3952, Jul. 2019. [Baidu Scholar] 

40

S. Bai, J. Z. Kolter, and V. Koltun. (2018, Mar.). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. [Online]. Available: https://arxiv.org/abs/1803.01271 [Baidu Scholar] 

41

A. Borovykh, S. Bohte, and C. W. Oosterlee. (2017, Mar.). Conditional time series forecasting with convolutional neural networks. [Online]. Available: https://arxiv.org/abs/1703.04691 [Baidu Scholar] 

42

A. van den Oord, S. Dieleman, H. Zen et al. (2016, Sept.). WaveNet: a generative model for raw audio. [Online]. Available: https://arxiv.org/abs/1609.03499 [Baidu Scholar]