1 Introduction

Electrical energy is the basic resources of national economy and people’s life. Power systems play a key role in power generation and transmission. In the past few decades, power systems have enormously expanded in scale and become more complex in structure. As a result, reliability is becoming an important issue. With the development of smart grid, electrical equipment reliability and automation technology have been improved on large scale. However, power systems cannot operate without human by far, and people’s unsafe behaviors and human errors can have a great impact on power systems [1]. To further improve power system reliability, it is necessary take human factors into consideration.

Through analyzing major incidents of power systems in last decades, it can be seen that human factors made significant contributions to these failures [2]. Human errors are identified as one of the main causes of the blackout in North American in August 2003 [3] and the Italian blackout in September 2003 [4]. Human errors could occur in any situations involving people, such as power system operation, electrical equipment maintenance and power system dispatching [5]. Although we gradually realize the importance of human factors in power systems, there are few researches in this area.

For better analysis of human factors’ impact on power system reliability, we should figure out human error mechanism and recognize how human errors occur. Proper analysis methods are necessary, especially for quantitative assessment. Furthermore we need to demonstrate the influence of human factors on power system from several main aspects. After this, we could obtain some measures to improve human operational reliability.

In this paper we make a comprehensive introduction of human errors and some common accidents resulting from human factors. According to specific operation scenarios, we establish several models of human factors, and propose corresponding methods for human reliability analysis (HRA). These methods are verified by some power system practical cases. On this basis, we establish a modified maintenance model considering imperfect maintenance caused by human errors. Furthermore, the influence of human factors on dispatching operation and power system cascading failure are analyzed through IEEE 24-bus test system. Finally, a novel Dispatcher Training Evaluation Simulation System based on information decision and action in crew (IDAC) is established, which can consider all the influencing factors. Once fully developed, it can be used for dispatcher dynamic assessment in order to find out operators’ shortcomings and improve power system dispatching reliability.

2 Human errors and human factors in power systems

Human errors can be defined as any human actions, both cognitive and physical, that potentially or actually result in negative effects on system’s normal functions [6]. As power systems become more complex, human operators are supposed to work in various situations, and they may encounter all kinds of emergencies. If human behaviors exceed an acceptable limit, it could lead to a disaster.

The final report on August 14, 2003 blackout in the United States and Canada shows dispatchers’ lack of monitoring of grid state is an important cause leading to cascading failure [3]. In the 5.25 Moscow blackout, dispatchers failed to take measures after a large number of tripping, which caused the accident to expand [7]. On May 7, 2004, Golmud power grid split from main grid due to substation personnel’s fault action on protection device. On April 1, 2005, operation personal’s misoperation resulted in power outage of 220 kV Lingyuan substation.

Some of these accidents are attributed to human errors, however we seldom try to investigate the cause of human errors. Human error is not a cause, but a consequence, which is shaped and provoked by the upstream factors [8]. Operators’ actions in power systems can be affected by various factors, like external environment, complexity of operation task, operators’ knowledge and experience, and so on. We consider all these factors that may cause human errors as human factors. In some researches, performance influencing factors (PIFs) and performance shaping factors (PSFs) [9] are used to describe human factors. PIFs and PSFs are usually classified according to various standards and purposes. Reference [10] proposed a data-informed PIF hierarchy for human reliability analysis, which consists of five categories: organization related, time related, person related, situation related and machine related factors. Through investigation we can see that human operation could be affected by many factors in power systems, such as task complexity, operation period, experience, physical state and so on. In different situations, the dominating factors that have the greatest influence on human reliability may be different. Therefore, it is important to determine the exact PIFs according to actual situations.

3 Human factors modeling and HRA methods in power systems

It is widely recognized that human errors could not be avoided completely. However, we can take measures to reduce human error probability. Human reliability is the opposite concept of human error. As an essential part of probabilistic safety assessment (PSA), HRA has been widely researched in many fields which have higher requirement on reliability, as in nuclear power plant and aerospace [11]. Qualitative and quantitative HRA could be used in system design, operation and optimization in order to improve human reliability. Nevertheless, in the aspect of power systems, there are very few studies about HRA.

With the development of HRA, many methodologies are established to analyze human errors, such technique for human error rate prediction (THERP) [12], cognitive reliability and error analysis method (CREAM) [13], human error assessment and reduction technique (HEART) [14], a technique for human error analysis (ATHEANA) [15]. Moreover in some references, human reliability is assessed using a Markov model with a constant transition rate for human error [16, 17]. In order to recognize personals’ cognitive process when dealing with system failures, several dynamic HRA methods are proposed, and IDAC is a typical dynamic HRA method [18]. Reference [19] identified requirements for human reliability model to be integrated into system dynamic probabilistic risk analysis. Reference [20] described the existing dynamic HRA simulations, and gave a prospect about next work to increase the fidelity of simulated accident scenarios. Lack of appropriate and sufficient performance data has been identified as a key factor affecting HRA quality, especially in the estimation of human error probability. Therefore, U.S. Nuclear Regulation Commission (NRC) tried to develop a HRA database (SACADA) to satisfy this data need [21].

We should notice that most of these methods originated in other industries, and they not specified for power systems. So it is necessary to propose several HRA methods suitable for power system specific situations. As we know, the primary cause of human errors differs a lot in different operation scenes. Thus, it is significant to make proper classification of power system operation scenarios for human reliability analysis. According to the investigation, power system operation scenarios are classified into 3 categories: time-centered scenarios, process-centered scenarios and emergency-centered scenarios. Then three HRA methods suitable for the above three scenarios are proposed respectively.

3.1 Time-centered HRA (TCHRA)

Time-centered scenario refers to situations where operators should continue to work for a long time without interruptions, such as system state monitoring and new equipment debugging. Operators will become fatigue and the probability of human error to occur will increase accordingly. Statistics show that many accidents are caused by people’s fatigue [22]. It is obvious that continuous working time (CWT) is the primary factor that affects human reliability in this scenario. Besides, some other human factors may also influence this process, such as task complexity, environment factors, human knowledge and experience.

Proportional hazard model (PHM) [23] could be used for quantitative analysis of time-centered scenario. PHM has been wildly used in the field of engineering, biology and mechanics. The hazard function in PHM consists of two parts: baseline function and link function. The hazard function can be expressed as

$$h(t,\varvec{Z}) = h_{0} (t)\psi (\varvec{Z}),\;\;\;{\kern 1pt} t \ge 0$$
(1)

where h 0(t) is the baseline function which could be used to indicate the change of human reliability with CWT; \(\psi (\varvec{Z})\) is the link function, which could be used to indicate the influence of covariates, \(\varvec{Z}\), on human reliability. In TCHRA, five main covariates are considered: task complexity z 1, environment factors z 2, human knowledge and experience z 3, human psychology z 4 and physical state z 5. Therefore, it could be defined as

$$\varvec{Z} = \left[ {z_{1} ,z_{2} ,z_{3} ,z_{4} ,z_{5} } \right]$$
(2)
$$\psi (\varvec{Z}) = \exp (\gamma \varvec{Z})$$
(3)

We suppose the influence coefficient of each covariate could be 0, 1 or 2. When influence coefficient is lager, this factor has more effect on human reliability, and human errors are more likely to occur. γ is the weight value of each covariate. Since available data is limited, we cannot obtain the weight value through fitting process by far. In this paper, the weight value of covariate is obtained via analytic hierarchy process (AHP) [24]. Through expert assessment, the five covariates are compared in pairs with respect to their relative importance to human error probability. Then their value weight could be calculated.

Assuming the operation begins at time t = 0, then human reliability function could be expressed as

$$R_{\text{hp}} (t) = P(T \ge t) = \exp \left[ { - \int_{0}^{t} {h(s,\varvec{Z}){\text{d}}s} } \right]$$
(4)

where R hp(t) is the probability that human error has not occurred before the moment t. According to [25], Weibull distribution function could be adopted as baseline function, as shown in (5).

$$h_{0} (t) = \frac{{\beta t^{\beta - 1} }}{{\alpha^{\beta } }}$$
(5)

The parameters can be estimated through careful statistical analysis. According to [25], β = 3, α = 200 hours.

In order to illustrate the relation between human reliability and continuous working hours, we suppose there are three irrelevant scenes. Through expert assessment, the influence coefficient and weight value in different scenes are obtained, shown in Table 1.

Table 1 Influence coefficient and weight value in each scene

Then human reliability function could be expressed as

$$R_{\text{hp}} (t) = \exp \left[ { - \int_{0}^{t} {\frac{{\beta s^{\beta - 1} }}{{\alpha^{\beta } }}\exp (\gamma \varvec{Z}){\text{d}}s} } \right]$$
(6)

The probability of human error could be expressed as

$$F_{\text{hp}} (t) = 1 - R_{\text{hp}} (t)$$
(7)

With the increase of CWT, human error probability changes as shown in Fig. 1.

Fig. 1
figure 1

Human error probability in TR-HRA

In Fig.1, when CWT is less than 10 hours, the human error probability is extremely low (less than 7 × 10−4). When CWT increases, human error probability increases accordingly. Although the staffs work less than 10 hours a day under normal conditions, long-time lasting work still exists, such as annual inspection of main transformer while the whole substation is out of power, and new equipment debugging before operation. Through investigation, we find that human errors are more likely to happen in these situations.

Thus, in order to ensure operational reliability, we should limit continuous work within reasonable time. Besides, we could take some measures to improve human reliability, such as improving operator’s skill and experience by training, improving operator’s mental and physical state, and making work condition more suitable.

3.2 Process-centered HRA (PCHRA)

Process-centered scenario refers to situations where operation task consists of many steps, and operators should follow certain procedures to finish the work. We should pay attention to the process to avoid human errors. Modified CREAM could be used to analyze this kind of scenario. CREAM [13], proposed by Hollnagel E, hold the idea that cognitive functions contain several generic failure types. CREAM concluded the basic probability value of each generic failure type, which is called cognitive failure probability (CFP). The nominal values of cognitive function failures are shown in Table 2.

Table 2 Nominal values for cognitive function failures

In CREAM, all human factors are divided into 9 categories, called common performance conditions (CPC). The expected influence of CPCs on human reliability could be generalized as three levels: reduced, not significant and improved, shown in Table 3.

Table 3 Common performance conditions in CREAM

The CREAM standard method divided the control model into four classes, Strategic, Tactical, Opportunistic and Scrambled. Each control model has a corresponding error probability interval [13]. Although CREAM method has been widely accepted and used in many fields, some aspects require improvement. Since the CPCs are not specially introduced for power systems, we should concretize CPCs according to regulations and actual conditions in power systems [26]. For example, working conditions could be divided into sub-CPCs: personal security requirement, equipment security requirement and environment requirement.

We could assess each sub-CPCs firstly, then we can obtain the score of CPCs with analytic hierarchy process. The score of CPC varies from 0 to 100 according to the concrete conditions except Time of day, which varies from 0 to 24. Since human reliability analysis is still at the starting stage in power systems, and related data statistics is still very scarce. The use of expert systems, such as fuzzy expert systems, can be helpful to improve the assessments with limited data available [27]. In this paper, triangular fuzzy model is used to lower subjectivity of judgment [28].

The process of quantifying human error in process- centered scenario is shown is Fig. 2.

Fig. 2
figure 2

Process of quantifying human error

In PCHRA, we should first determine the cognitive function and operation scenario according to concrete operation task. Then we could calculate the basic value of human error probability (HEP) after analyzing the generic failure types. For example, when executing one action, we find the generic failure types are: action of wrong time (E2), action out of sequence (E4) and missed action (E5), then basic HEP value of executing this action could be calculated using the following equation.

$$P_{\text{E}} = 1 - (1 - P_{{\text{E2}}} )(1 - P_{{\text{E4}}} )(1 - P_{{\text{E5}}} )$$
(8)

After obtaining basic value of HEP, we should analyze the level of CPCs and obtain correction coefficient using (9). Then we can obtain the final result of HEP with (10).

$$\beta = \sum \rho_{i}$$
(9)
$$P_{\text{HEP}} = P_{{{\text{HEP}}_{0} }} \times 10^{0.25\beta }$$
(10)

where P HEP0 is the total basic HEP value of the whole operation task; P HEP is the final value of HEP; β is the HEP correction coefficient; ρ i is the influence coefficient of CPCi.

We take Xuyue station as an example for analysis [29]. The main transformer turning to operation from cold standby needs ten steps, and the operation in Step 2 is shown in Table 4. The cognitive function and generic failure types in Step 2 is shown in Table 5. With (8) we could calculate the basic value of HEP in Step 2 is 0.011, and the HEP of the whole process is 0.0753.

Table 4 Step 2 of operation ticket to run spare transformer
Table 5 Human error analysis of operation Step 2

We suppose the operation is conducted in 3 different contexts, and Context 3 represents the worst situation. In Context 3, the organization is inadequate, and there is lack of sound management system; the working conditions is unpleasant; there is not enough operational support; there exists some deficiencies in arrangement; the work is complex and time load is heavy for the current operators; what’s more, the task is conducted at 4 a.m. Context 1 represents the best-case of these three contexts, and Context 2 is somewhere between Context 1 and Context 3. According to CREAM basic method, we could figure out that Contexts 1, 2, 3 belong to Tactical, Opportunistic and Scrambled control model, respectively.

Through scenario analysis, we could obtain the scores of CPCs in different contexts. Then we could calculate the membership of each level with triangular fuzzy model. With (9) and (10), we could obtain correction coefficient and the final value of HEP, shown in Table 6.

Table 6 Scores of CPC in three different contexts

From the simulation we can see, on the one hand, the results of these three contexts locate in the reliability interval of appropriate control model. It proves the validity of the proposed methods. On the other hand, we can conclude that Context 1 is more suitable for human operation when comparing with Context 2 and Context 3. It demonstrates the impact of CPCs on human operation quantitatively. We could also calculate the change of human reliability when CPC differs. Furthermore, we could take directed measures according to the simulation results. For example, if time of day (4 a.m. in Context 3) is a main influencing factor, we could adjust to finish the work in the day if possible in order to improve human reliability.

3.3 Emergency-centered HRA (ECHRA)

Emergency-centered scenario refers to situations where power system failures occurred and operators need to react in a short time, including diagnose fault and take proper measures. In this scenario, human reliability has significant effect on clearing faults and recovering system reliability. Human cognitive reliability (HCR) [30] method could be used to quantify HEP in emergency-centered scenarios.

According to different ways of response, human behavior is usually divided into 3 categories. This classification is commonly known as skill-based, rule-based and knowledge-based (SRK) framework [31]. Skill-based behavior is assumed to be highly integrated patterns of behavior. Since the operator is so familiar with the situations, human behavior takes place without conscious attention. Rule-based behavior refers to executing routine tasks according to regulations strictly. This type of behavior is typically controlled by a stored procedure. During unfamiliar situations, no procedures are available, and human behavior is considered as knowledge-based. The operator has to rely upon their knowledge to make decisions and deal with the operation task.

It is not hard to find that human errors are more likely to occur in knowledge-based situations and less likely to occur in skill-based situations. According to survey results [30], it is recognized that once the operation task, scenario and operators are determined, human error probability is only related to the ratio of operation allowable time (t) and operation execution time (T 1/2). The relationship could be expressed with Weibull distribution function with 3 parameters, which is shown in (11).

$$P(t) = \exp \left[ { - \left( {\frac{{t/T_{1/2} - \gamma }}{\alpha }} \right)^{\beta } } \right]$$
(11)

where P(t) is the probability of human error; α, β, γ are dimension, shape and location parameters, and their values are determined by operation category [32], shown in Table 7; t is the operation allowable time which is determined by power system characteristic; T 1/2 is the operation execution time which could be obtained by (12).

Table 7 Values of α, β, γ in different operation categories
$$T_{1/2} = T_{1/2,n} (1 + K_{1} )(1 + K_{2} )(1 + K_{3} )$$
(12)

where T 1/2,n is the average execution time in regular situation, which could be obtained according to the statistics; K 1, K 2, K 3 are the adjustment coefficients to execution time from aspects of training, operator mental state and operation support.

For example, when line protection channel fault occurs, main protection should quit operation manually. This process is supposed to be finished within 8 minutes. Through investigation we find that the average time to finish the job is about 5 minutes, and it depends on operators. Operator A has experience in dealing with such situations, while Operator B is inexperienced, and should follow the regulations to finish the job. The parameters could be defined as Table 8.

Table 8 Parameters values for different operators

From Table 8, we can see that Operator A might fail to finish the operation with a probability of 0.005, while the probability is 0.341 for Operator B. Through analysis we can conclude that when facing with emergency situations, experience, operation time and psychological state will affect human operation a lot. High quality is an essential way to enhance human reliability in emergency conditions.

4 Impact of human factors on maintenance

Electrical equipment maintenance is significant to maintain power systems stable, prolong the service life of equipment and reduce the system power loss. According to statistics of grid accidents, maintenance personal mistakes occupy a large proportion [33]. In this part, we first establish a periodic maintenance (PM) model considering imperfect maintenance caused by human factors, and demonstrate the impact of human factors on maintenance availability with a simple case.

4.1 Electrical equipment maintenance model considering imperfect maintenance

In most cases, analysts assume that maintenance is totally perfect, which is unrealistic. Effect of maintenance could be weakened by human factors, and more than that, the system occasionally becomes even worse due to human errors [34]. Several common human errors and their external forms in maintenance are listed below.

  1. 1)

    Latent failures which are not detected during maintenance due to operators' insufficient awareness.

  2. 2)

    Wrong adjustments, incorrect estimations of system states and inappropriate decisions.

  3. 3)

    Replacement with fault parts and damages introduced during maintenance, which could be attributed to human action errors.

The results of maintenance will be quite different due to different levels of human reliability, as shown in Fig. 3 [35]. According to maintenance quality, the results can differ from perfect maintenance to maintenance failure. Since we aim to demonstrate the impact of human errors on equipment maintenance in this paper, we make two assumptions: ① other factors are completely reliable except human factors; ② results of maintenance consist of three categories considering human factors.

Fig. 3
figure 3

Results of maintenance considering human errors

Category 1: perfect maintenance, denoted as PM, namely the system becomes as good as new after maintenance.

Category 2: as bad as old, denoted as ABAO, namely the system state does not change after maintenance.

Category 3: failure after maintenance, denoted as FAM, namely maintenance failure occurs, and the system needs repair after maintenance.

When human errors occur, the maintenance is considered as imperfect. The probability of human error (hep) could be obtained with PCHRA method proposed in Section 3. The percentage of human error cause maintenance failure is defined as ξ. The probability of PM, ABAO and FAM could be expressed as follow.

$$P_{{\text{FAM}}} = hep\cdot\xi$$
(13)
$$P_{{\text{ABAO}}} = hep\cdot(1 - \xi )$$
(14)
$$P_{{\text{PM}}} = 1 - P_{{\text{ABAO}}} - P_{{\text{FAM}}}$$
(15)

It is assumed that system begins as new and the age is set as t = 0. The maintenance period is ΔT and every maintenance time is Δt. If the system fails during operation, it will be repaired with mean time μ 2. If the system fails after maintenance, mean repair time will be μ 1. Under normal circumstances, μ 1 is smaller than μ 2, since a failure during operation is an emergency, and the repair is not prepared in advance. If system state does change after maintenance, it will continue operating. If the maintenance is perfect, or the system is repaired after failure, the system is renewed and system age returns to 0. Equipment periodic maintenance model is shown in Fig. 4.

Fig. 4
figure 4

Equipment periodic maintenance model

Given the above description, the relation can be derived as follow.

$$P_{\text{mf}} = P_{{\text{FAM}}} \sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} R(j\Delta T)}$$
(16)
$$P_{\text{af}} = (1 - P_{{\text{ABAO}}} )\sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} F(j\Delta T)}$$
(17)
$$P_{\text{pm}} = P_{{\text{PM}}} \sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} R(j\Delta T)}$$
(18)

where R(t) is the reliability function; F(t) is the cumulative distribution function of system failure; P pm, P af, P mf are the probability that system is renewed by perfect maintenance, repair after actual failure and repair after maintenance failure. The mean time to renewal (MTTR) is

$$\begin{aligned} MTTR = \sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} \int_{(j - 1)\Delta T}^{j\Delta T} {t{\text{d}}} F(t) + } \left( {P_{{\text{FAM}}} + P_{{\text{AGAN}}} } \right) \hfill \\ { \times }\sum\limits_{j = 1}^{\infty } {(j\Delta T)P_{{\text{ABAO}}}^{j - 1} R(j\Delta T)} \hfill \\ \end{aligned}$$
(19)

The availability of system could be expressed as (20) when maintenance time is neglected.

$$A(\Delta T) = \frac{MTTR}{{MTTR + \mu_{1} P_{{\text{FAM}}} \sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} R(j\Delta T)} + \mu_{2} (1 - P_{{\text{ABAO}}} )\sum\limits_{j = 1}^{\infty } {P_{{\text{ABAO}}}^{j - 1} F(j\Delta T)} }}$$
(20)

Through a simple case, we will analyze the impact of human errors on maintenance availability.

4.2 Results of case study

The proposed methodology is illustrated using a system of 3 units [36] and the reliability function of this system is

$$R(t) = 3{\text{e}}^{ - 2t} - 2{\text{e}}^{ - 3t}$$
(21)

In this case, we assume hep increases from 0.05 to 0.9, ξ is 0.7, and the ratio of μ 1/μ 2 is 0.5. With the variation of the maintenance period, maintenance availability changes as shown in Fig. 5. ΔT* is normalized by 104 hours. For example, if ΔT* = 0.02, then ΔT = 0.02 × 104 = 200 hours.

Fig. 5
figure 5

Maintenance availability with different maintenance period

It can be seen that when hep increases from 0.05 to 0.9, the maintenance availability decreases if ΔT* is less than 1, which is the mean time to failure (MTTF) of the system. When hep is smaller than 0.5, there exists optimal maintenance period that maximizes the availability. While hep is larger than 0.5, maintenance will not be able to improve the system availability any more due to the negative effect of human errors.

From the results of case study, we can see that human errors affect maintenance availability a lot, and we should take human factors into consideration when determining the optimal maintenance period.

5 Impact of human factors on dispatching operation

Reasonable dispatching is a key part in maintaining power systems reliable and secure. However, dispatching operation faces risks due to uncertainties, such as adverse weather, equipment state and human errors. Common human errors in power system mainly include three categories.

  1. 1)

    Insufficient of situation awareness, referring to the situation where dispatchers fail to have a comprehensive acquisition of system information in time, or dispatchers fail to have a correct understanding of system state.

  2. 2)

    Dispatch decision errors, dispatchers might make wrong dispatch decisions due to insufficient experience or pressing time.

  3. 3)

    Dispatch action errors, which mainly refer to physical action mistakes occur during operation, including action of wrong type, action of wrong object or missed action.

Through analysis of latest grid accidents, we can conclude human errors have a great influence on power system reliability, especially in emergency situations. Human error probability in emergency situation could be calculated with ECHRA method proposed in Section 3. Since the allowable time is short, dispatchers might make mistakes under great pressure. In this part, we will analyze the impact of human errors on emergency dispatch and the development of power system cascading failures.

5.1 Impact of human factors on cascading failures

Cascading failure is one of main reasons that lead to power system blackout. Under normal conditions, transmission lines operate with a certain initial power load. However, a single outage may result in line thermal overload. If the overload could not be removed within permitted time, more components will be tripped one by one, which increasing the probability of cascading outage and blackout. Allowable time to relieve line’s overload is shown in Table 9 [37].

Table 9 Allowable working time at different load levels

In Table 9, P CON, P LTE and P STE are the continuous, long-time and short-time emergency ratings. We define the third condition as critical overload, because the overload lines should be tripped immediately. In the initial stage of a cascading failure, also called pre-cascading stage, dispatchers have enough time to take measures and prevent failures extending. If dispatchers fail to restore power system to normal state at this stage, it will enter fast-cascading stage, which can result in cascading outage and load disconnection.

In this part, we only consider the critical overload. Due to human errors, dispatchers may fail to finish the work at pre-cascading stage. We will evaluate the impact of human errors on dispatch operation in emergency condition. During the evaluation, shown in Fig. 6, we conduct “N − 1” test of the system, and all lines are tripped one by one as the initial event. A DC power flow is used for this analysis, and the critical overload line will be tripped. Dispatch operation will be correctly executed with a probability of human reliability.

Fig. 6
figure 6

Impact analysis of human errors on dispatch operation

5.2 Case study of IEEE RTS 79 system

In this part, IEEE RTS 79 system [38] is used for analysis. We suppose generator #10 and #18 are out of service for maintenance, and line capacity is adjusted. If Line 4 is tripped because of failure and operators failed to take any measures, system state will develop as follow.

From Table 10, we can see Line 28 will be overload after the outage of Line 4. Then Line 28 will be tripped beyond the allowable time, as a result Lines 24, 25, 26 will be overload. In the next stage, Lines 24-26, 30-33 and 38 will be tripped by automation device. By far Buses 17, 18, 21 and 22 are isolated. With the development of system stage, outage extends constantly. In Stage 5, more than half of buses are isolated and the system is split.

Table 10 Development of system state

During the process of failure extending, if operators take proper measures, like generation re-dispatching, it is possible to avoid system splitting. Amount of load shedding is different if the system is stabilized at different stage.

From Table 11, we can see if the system is stabilized at Stage 1 or 2, there will be no load shedding. The loss of load will increase to 657 MW and 1401 MW if emergency dispatching operation is successful at Stages 3 and 4 respectively. If all operations failed, the system will lose the whole load.

Table 11 Loss of load in different stage

Emergency dispatch operation may fail with a certain probability p in every stage. When the system is stabilized after one successful operation, amount of loss load (P LOSS) could be expressed as

$$\begin{aligned} P_{\text{LOSS}} = (1 - p)P_{\text{LOSS}} (1) + p(1 - p) P_{\text{LOSS}} (2) \hfill \\ \quad + p^{2} (1 - p)P_{\text{LOSS}} (3) + p^{3} (1 - p)P_{\text{LOSS}} (4) \hfill \\ \end{aligned}$$
(22)

where P LOSS(1), P LOSS(2), P LOSS(3), P LOSS(4) are the loss loads at each stage. It should be noticed that, we only consider the first 4 stages, because the system will probably split if we cannot make it in the first 4 stages.

We suppose there are 3 scenarios, and the human operation error probability p could be calculated with proposed ECHRA method. Results of p are shown in Table 12. Line transmission capacity is set as 75% of the rated capacity. After the evaluation shown in Fig. 6, we get the following results.

Table 12 Loss of load in different operation scenarios

From Table 12, we can see loss load is minimal in Scenario 1, while maximal in Scenario 3. In Scenario 1, human operation reliability is the highest, and operators could take emergency dispatching to stabilize the system early; while in Scenario 3, human error probability is the largest, failures could spread due to human factors. As a result, more loads will be shed when system is stable.

It should be noticed that, in this section we aim to analyze the impact of human errors on emergency dispatch and the development of power system cascading failures. To simplify the discussion, we neglect the dependency among operators. Further research will be conducted to establish a comprehensive analysis model for dispatching operation, which focuses on the whole crew instead of single operator.

6 Dispatcher training evaluation simulation system

From analysis in Section 5, we can conclude that human factors make a great impact on dispatch operation and power system reliability. So we should apply human error theory and human reliability analysis to practical. In this part, we will propose a framework of dispatcher training evaluation simulation system (DTESS), which could be used as a tool for dispatcher training simulation. Different from conventional dispatcher training simulator (DTS) [39], DTESS is completely based on simulation and operators’ cognition process is modeled with IDAC method. Once fully developed, it can simulate dispatchers’ response to various conditions in detail. The framework of DTESS is shown in Fig. 7.

Fig. 7
figure 7

Framework of DTESS

As shown in Fig. 7, DTESS consists of 4 modules: main program module, operator module, scheduler module and power system simulator module.

  1. 1)

    Operator module Operator module aims to model dispatcher’ response to system in different situations. Based on the IDAC, this model mainly consists of three parts: I-D-A cognition model [40], performance influencing factors [41] and rules of behavior [42]. The operators’ state is initialized at the very beginning. During the evaluation, some static PIFs will stay unchanged over a period of time, while dynamic PIFs should be assessed over every time step. According to the rules of behaviors, this module will generate operator actions.

    Dispatchers are supposed to take various trainings and tests, such as skill training, security testing, psychological test and qualifications grading. All these results could be taken into consideration to make the modeling of dispatcher more precise.

  2. 2)

    Scheduler module DTESS uses dynamic event tree (DET) [43] to represent scenario development following an initiating event. This module controls the evolution of event sequences, and the branches are generated when system state changes or operators take actions at every time point. Some termination criteria should be determined before the simulation, for example branch probability is less than a specific value and power system splits into disconnected parts. A sequence will be terminated when the termination criteria is met.

    Another function of this module is to save information at each point, like states of power system, dispatchers’ action and branch probability. We could obtain details of dispatchers’ operation by retrieving this information.

  3. 3)

    Dispatcher simulator training module This module includes most parts of conventional DTS. It simulates static or dynamic process of power systems, including the behaviors of relay and automatic equipment. Another function of control center model is to provide interaction between power system and dispatcher. On one hand, it reveals power system present state to dispatcher model through data acquisition, data processing, event and alarm processing, remote adjustment and control, man-machine interface, etc. On the other hand, actions form dispatchers are implemented with this model.

  4. 4)

    Main program module Main program module is the controlling part of the framework, managing the calls to other modules. The general flow of DTESS is shown in Fig. 8. At the beginning of the evaluation, state of power system and levels of PIFs are initialized. According to dispatcher model and power system model, scheduler model decides whether DET branches are generated. If there is more than one branch, the scheduler model will save the branch information and proceed with simulating the first branch until it meet the termination criteria. Then the end state information is stored and next branch information is loaded to continue the simulation. When all the sequences are simulated, the simulation will be terminated.

    Fig. 8
    figure 8

    Flow chart of simulation process in DTESS

Compared to DTS, DTESS has many advantages. First, DTESS could record dispatchers’ actions, both cognitive and physical, in response to an initial event in detail. Furthermore, since all the probabilities are stored, so we can assess a dispatcher more objectively and accurately. Through analysis of simulation results, we can find out operators’ shortcomings and improve dispatch reliability. Besides, DTESS could also be used to assess the quality of other trainings, like security training and skill training through adjusting operator module.

7 Conclusion

Human factors have great impact on power systems reliability; however, there are few researches in this field. In allusion to this situation, we attempt to analyze the impact of human factors on power system reliability comprehensively. Main contributions of this paper include the following aspects.

  1. 1)

    Through analyzing human errors and operation scenarios in power systems, we established human factor models and proposed 3 human reliability analysis methods. Since these methods are based on practical characteristics of power system operation scenarios, they are suitable for power systems, and they are verified by some power system practical cases.

  2. 2)

    We analyzed the impact of human factors on maintenance. Electrical equipment maintenance could not be always perfect due to human errors, and maintenance availability can be affected a lot. So it is necessary to take human factors into consideration when determining maintenance policy.

  3. 3)

    We analyze the impact of human errors on emergency dispatch. Analysis and evaluation results demonstrate that it could avoid cascading failures and reduce power loss by improving human operation reliability.

  4. 4)

    Based on IDAC, we propose a framework of dispatcher training evaluation simulation system, which could be used as a tool for dispatcher training simulation. It could take all the influencing factors into account, and make a comprehensive assessment of dispatchers. With DTESS we can find out operators` shortcomings and improve dispatch reliability

As a noteworthy issue, human reliability analysis in power systems deserves more attention. We should take further researches into how to quantify human error probability, the influence of human factors on power system, and the measures taken to reduce human errors and enhance power system reliability.