Abstract
The power market is a typical imperfectly competitive market where power suppliers gain higher profits through strategic bidding behaviors. Most existing studies assume that a power supplier is accessible to the sufficient market information to derive an optimal bidding strategy. However, this assumption may not be true in reality, particularly when a power market is newly launched. To help power suppliers bid with the limited information, a modified continuous action reinforcement learning automata algorithm is proposed. This algorithm introduces the discretization and Dyna structure into continuous action reinforcement learning automata algorithm for easy implementation in a repeated game. Simulation results verify the effectiveness of the proposed learning algorithm.
ELECTRICITY market reforms are gradually occurring around the world, particularly in China. A electricity market typically includes power suppliers, independent system operators (ISOs), and power consumers. Power suppliers bid in the market to satisfy the electricity demand of power consumers, while an ISO is responsible for the operation and maintenance of the market. Due to the limited number of power suppliers, the power supply-side market is typically considered as an oligopoly market, where the profit of one supplier will be affected by both the power system operation condition and the bidding actions of the other suppliers. Thus, all suppliers are incentivized to bid strategically [
The most common approach to developing supplier bidding strategies is to establish a game-theoretical model [
Under such circumstances, the reinforcement learning [
However, existing studies typically formulate the supplier bidding and market clearing process as a Markov (stochastic) game [
Additionally, most studies still assume that suppliers can obtain their rivals’ historical bidding information. This assumption may not apply, particularly in the early stages of a market, where the power supplier only has access to its own historical bidding information. Few studies have discussed the case where suppliers have to bid with little external information. Besides, the efficiency of the algorithms has not been paid enough attention, leading to inefficient learning process.
The contributions of this paper are outlined below:
1) This paper defines the bidding procedure of power suppliers with thermal power units as a repeated game [
2) This paper proposes a modified continuous action reinforcement learning automata (M-CARLA) algorithm to enable power suppliers to bid with limited information in the repeated game. This algorithm combines the discretization and Dyna structure [
The remainder of this paper is organized as follows. Section ΙΙ presents the market structure and the repeated game. Section III details the proposed M-CARLA algorithm. A case study is performed in Section IV. Section V concludes the paper.
A power market typically includes three major parts: the power suppliers, the power consumers, and the market operator.
The supply function model [
(1) |
where i is the index of the power supplier; Ci is the cost function; ai and bi are the coefficients of the secondary and primary terms, respectively; and gi is the dispatched power output.
Before each round of market clearing, supplier i submits the cost function to the market operator. The power supply-side market is imperfectly competitive, motivating power supplier i to bid strategically to gain a higher profit. The strategic factor can be the slope or the intercept of the supply function and it is assigned as the slope in this paper.
Based on this assumption, the submitted cost function will become:
(2) |
where Ci,submit is the submitted cost function; and ai,strategic is the strategic slope.
After each round of market clearing, supplier i obtains the dispatched power output gi and LMP of the node where it is located.
The objective of supplier i is to maximize its profit qi:
(3) |
The utility function [
(4) |
where j is the index of the power consumer; Uj is the utility function; cj and dj are the coefficients of the primary and secondary terms, respectively; and lj is the load demand.
Before each round of market clearing, consumer j submits the true utility function to the market operator.
After each round of market clearing, consumer j obtains the dispatched power demand lj and the LMP of the node where it is located.
The market operator gathers the bids of all power suppliers and consumers and then runs the economic dispatch algorithm. The objective function is:
(5) |
where I is the set of suppliers; and J is the set of consumers.
The objective is to maximize social welfare. The equality constraint of the optimization problem is the balance of power generation and consumption:
(6) |
The inequality constraints include the power flow constraints of transmission lines, the generation limits of suppliers, and the demand limits of consumers:
(7) |
(8) |
(9) |
where Fy is the power flow of the transmission line y; is the upper limit of ; Y is the set of transmission lines; gi,min and gi,max are the lower and upper limits of the power output of suppliers, respectively; and lj,min and lj,max are the lower and upper limits of the power demand, respectively.
The power flow of each transmission line can be calculated based on [
(10) |
where F is the power flow matrix; T is the power transfer distribution factor (PTDF) matrix; and G and L are the power output and the load consumption matrices, respectively.
This paper focuses on optimizing a single time period bidding strategy in a real-time market (RTM). In

Fig. 1 Market structure based on supply function model.
The market in a repeated game can be defined as stationary or nonstationary to a single strategic supplier. If other suppliers bid their true marginal cost function, the environment is stationary; if other suppliers also bid strategically, the environment is nonstationary.
The continuous action reinforcement learning automata (CARLA) algorithm [
However, this algorithm is difficult to use due to large symbolic and integration operations in continuous action space. As iterations continue, computation costs are high, and calculations may be unsolvable [
The M-CARLA algorithm contains four steps that repeat from Step 2 through Step 4.
Step 1: initialize the PDF.
The bidding action (i.e., the strategic slope) and the action PDF at the
Because suppliers have little prior knowledge about the market, will be initialized as a uniform distribution:
(11) |
where amax and amin are the upper and lower limits of the slope a, respectively.
Step 2: choose actions.
The action space is divided equally into x equal subintervals with the endpoints as {a0, a1, ..., ax}, where each segment length is d. The continuous PDF is then replaced by discrete values at different endpoints. At the
Based on the trapezoidal rule [
(12) |
After calculating the areas of all subintervals, the cumulative probability of the action at endpoint m can be calculated by:
(13) |
Before an action is selected, a random variable z(n) is generated from the uniform distribution over [0,1]. The subinterval t is determined according to z(n) based on the cumulative probability, then a(n) can be written as:
(14) |
This process preserves the continuity of the selected action, which is different from the finite action learning automata (FALA) algorithm [
The following example in

Fig. 2 “Choose actions” process. (a) Continuous action PDF. (b) Discretized action PDF.
Assume that the action space is divided into 8 subintervals. The continuous action PDF in
(15) |
Step 3: generate reinforcement signals.
After the market is cleared in the current round, the power supplier will calculate the profit q(n) based on (3) to obtain the real experience (a(n), q(n)). Then, the strategic supplier will evaluate the reinforcement signal as:
(16) |
where qmax and qmed are the maximum and the median values in data buffer 1, respectively.
Data buffer 1 provides the historical profit data for evaluating the reinforcement signal; the initial value in data buffer 1 is 0. A larger indicates a stronger reward signal, while a smaller indicates a stronger punishment signal. The supplier saves q(n) into data buffer 1 after this evaluation. To avoid storage overflow, only the latest L rounds of q(n) are saved.
However, solely relying on interactions with the real world is sometimes inefficient. Inspired by the Dyna structure [
To generate a virtual experience at the
(17) |
where av(n) is the virtual action at the
The mapping from the virtual action to the corresponding profit is a regression problem. The K-nearest neighbor (KNN) method [
The corresponding virtual profit can be generated as:
(18) |
Then, the virtual reinforcement signal at the
Step 4: update the PDF.
Two Gaussian neighborhood functions h1(n) and h2(n) are defined at the
(19) |
(20) |
where and are the height and width of the update signal, respectively.
At the
(21) |
where is a weight factor to describe the importance between the real and virtual experiences. is larger in a stationary environment compared with a nonstationary environment since the usable experience is changing all the time as learning progresses in the nonstationary environment. and can be calculated based on the composite trapezoidal rule [
An example shows the “update PDF” process, as shown in

Fig. 3 “Update PDF” process. (a) Continuous old action PDF. (b) Continuous update signal. (c) Continuous new action PDF. (d) Discretized old action PDF. (e) Discretized update signal. (f) Discretized new action PDF.
Assume that the action space remains divided into 8 subintervals. The modification transforms the symbolic operation
Simulations are run in MATLAB R2020a. The primary objective lies in validating the effectiveness of the M-CARLA algorithm.
The topology of the 8-bus testing system is based on [

Fig. 4 Topology of 8-bus system.
There are 6 power suppliers in buses 1, 3, 4, 5, 6, and 7. The parameters of all power suppliers are shown in
There are 5 power consumers in buses 2, 3, 4, 6, and 8. The parameters of power consumers in 8-bus system are shown in
The DC power flow model is used, and the reactance of each transmission line is set to be 0.1 p.u.. The capacities of transmission lines 3, 7, and 10 are set as 100 MW to cause congestion.
To better describe the superiority of the M-CARLA algorithm, the existing algorithms in the repeated game environment are compared. The results of the qualitative analysis are shown in
From this comparison, it can be found that the proposed algorithm has much lower information requirements. Therefore, the proposed algorithm is more suitable for use within limited-information environments.
The Nash equilibrium calculated by analytical methods [
To provide a numerical index to evaluate its effectiveness, the accuracy A between the learning solution SL and the analytical solution SA is defined as:
(22) |
Because the action is chosen based on the action PDF, randomness is inevitable. To eliminate random factors, the same simulation is run 10 times to take an average in both stationary and nonstationary environments.
The learning parameters of all suppliers are shown in
1) Stationary Environment
A stationary environment indicates that except for the strategic supplier (the learner using the M-CARLA algorithm), the others are assumed to use the fixed strategies.
Six scenarios are investigated: each supplier is chosen as the learner in turn, and when a supplier is chosen, others fix their strategies as the Nash equilibrium. is set to be 0.3 in these stationary environments.
The learning results of power suppliers in different scenarios in stationary environment are shown in
The performance bound of the proposed algorithm in the stationary environment is 95%-100%. The learning process becomes stable after 100-200 iterations. The bid curves of the different power suppliers in the stationary environment are shown in

Fig. 5 Bid curves of different power suppliers in stationary environment. (a) Power supplier 1. (b) Power supplier 2. (c) Power supplier 3. (d) Power supplier 4. (e) Power supplier 5. (f) Power supplier 6.
2) Nonstationary Environment
A nonstationary environment indicates that all suppliers use the M-CARLA algorithm to bid. is set to be 0.1 in this nonstationary environment.
The learning results of all suppliers in the nonstationary environment are shown in
The performance bound of the proposed algorithm in the nonstationary environment is 90.0%-97.8%. The learning process becomes stable after 200-300 iterations. The bid curves of the different power suppliers in the nonstationary environment are shown in

Fig. 6 Bid curves of power suppliers in nonstationary environment. (a) Power supplier 1. (b) Power supplier 2. (c) Power supplier 3. (d) Power supplier 4. (e) Power supplier 5. (f) Power supplier 6.
The accuracy of actions and the learning efficiency in the stationary environment are higher than those in the nonstationary environment since the nonstationary environment introduces more randomness and uncertainty in the learning process. The computational complexity of this algorithm is described in Appendix B.
The parameters of the demand curves in this case study are constant. If the load fluctuates, a day can be divided into different periods with given load levels. The gaming process of the same period on different days can be considered to be a repeated game. The M-CARLA algorithm can be used in different repeated games to optimize the bidding strategy.
This paper proposes a practical bidding strategy for power suppliers with limited information. Firstly, the modeling method of the gaming process is proposed. The gaming process of thermal power suppliers that can provide flexible ramping is modeled as a repeated game based on the supply function model. Then, an M-CARLA algorithm is proposed to enable suppliers to bid based on only personal data. Finally, the proposed algorithm is tested in an 8-bus system to demonstrate its effectiveness in both stationary and nonstationary environments.
However, there are still certain limitations in this study: the virtual experience is not always reliable in a nonstationary environment, and the scalability of the proposed algorithm in a more complex and variable environment must be further validated. In future work, we plan to focus on how to use the historical experience to accelerate learning in a nonstationary environment and extend the algorithm to manage a fluctuating load profile.
Appendix
The Dyna structure combines model-free learning with a virtual model. The virtual model in the Dyna structure can generate virtual experiences to feed model-free learning. The general form of the Dyna structure is shown in Fig. A1.

Fig. A1 General form of Dyna structure.
The Dyna structure is extended from reinforcement learning and includes policy learning and model learning. In the interaction process, the structure integrates real experiences and virtual experiences. The real experiences are for learning policy/value, i.e., direct reinforcement learning (RL), and for learning the model concurrently. The simulated experiences produced by the model can be used to update the policy, i.e., indirect RL.
The number of subintervals markedly affects the computational complexity of the algorithm. Therefore, the computational complexity is analyzed from the perspective of a single supplier with different subinterval magnitudes, as shown in Table BI. The time is recorded which takes to update the action PDF and select an action in each round. To eliminate interference from other power suppliers, the environment is assumed to be stationary. All simulations are run on a computer with an Intel Core i
The computational complexity increases linearly as the magnitude of the subintervals increases.
References
M. Mallaki, M. S. Naderi, M. Abedi et al., “Strategic bidding in distribution network electricity market focusing on competition modeling and uncertainties,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 3, pp. 561-572, May 2021. [Baidu Scholar]
B. F. Hobbs, C. B. Metzler, and J. Pang, “Strategic gaming analysis for electric power systems: an MPEC approach,” IEEE Transactions on Power Systems, vol. 15, no. 2, pp. 638-645, May 2000. [Baidu Scholar]
Q. Jia, Y. Li, Z. Yan et al., “Reactive power market design for distribution networks with high photovoltaic penetration,” IEEE Transactions on Smart Grid, doi: 10.1109/TSG.2022.3186338 [Baidu Scholar]
M. Rayati, A. Sheikhi, A. M. Ranjbar et al., “Optimal equilibrium selection of price-maker agents in performance-based regulation market,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 1, pp. 204-212, Jan. 2022. [Baidu Scholar]
C. Huang, H. Zhang, L. Wang et al., “Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 743-754, May 2022. [Baidu Scholar]
H. M. Schwartz, Multi-agent Machine: A Reinforcement Learning Approach. New Jersey: John Wiley & Sons, 2014. [Baidu Scholar]
Y. Zhou, W. -J. Lee, R. Diao et al., “Deep reinforcement learning based real-time ac optimal power flow considering uncertainties,” Journal of Modern Power Systems and Clean Energy, doi: 10.35833/MPCE.2020.000885 [Baidu Scholar]
S. Wu, W. Hu, Z. Lu et al., “Power system flow adjustment and sample generation based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1115-1127, Nov. 2020. [Baidu Scholar]
D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]
N. Yu, C. C. Liu, and L. Tesfatsion, “Modeling of suppliers’ learning behaviors in an electricity market environment,” in Proceedings of 2007 International Conference on Intelligent Systems Applications to Power Systems, Toki Messe, Niigata, Nov. 2007, pp. 1-6. [Baidu Scholar]
N. Rashedi, M. A. Tajeddini, and H. Kebriaei, “Markov game approach for multi-agent competitive bidding strategies in electricity market,” IET Generation, Transmission & Distribution, vol. 10, no. 15, pp. 3756-3763, Nov. 2016. [Baidu Scholar]
R. Ragupathi and T. K. Das, “A stochastic game approach for modeling wholesale energy bidding in deregulated power markets,” IEEE Transactions on Power Systems, vol. 19, no. 2, pp. 849-856, May 2004. [Baidu Scholar]
Y. Ye, D. Qiu, M. Sun et al., “Deep reinforcement learning for strategic bidding in electricity markets,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1343-1355, Mar. 2020. [Baidu Scholar]
H. Xu, H. Sun, D. Nikovski et al., “Deep reinforcement learning for joint bidding and pricing of load serving entity,” IEEE Transactions on Smart Grid, vol. 10, no. 6, pp. 6366-6375, Nov. 2019. [Baidu Scholar]
D. Cao, W. Hu, and X. Xu, “Bidding strategy for trading wind energy and purchasing reserve of wind power producer–a DRL based approach,” International Journal of Electrical Power & Energy Systems, vol. 117, pp. 1-10, May 2020. [Baidu Scholar]
H. K. Nunna, A. Sesetti, and A. K. Rathore, “Multiagent-based energy trading platform for energy storage systems in distribution systems with interconnected microgrids,” IEEE Transactions on Industry Applications, vol. 56, no. 3, pp. 3207-3217, May 2020. [Baidu Scholar]
V. Hakami and M. Dehghan, “Learning stationary correlated equilibria in constrained general-sum stochastic games,” IEEE Transactions on Cybernetics, vol. 46, no. 7, pp. 1640-1654, Jul. 2016. [Baidu Scholar]
L. Li, C. Langbort, and J. Shamma, “An LP approach for solving two-player zero-sum repeated Bayesian games,” IEEE Transactions on Automatic Control, vol. 64, no. 9, pp. 3716-3731, Sept. 2019. [Baidu Scholar]
K. Hwang, W. Jiang, Y. Chen et al., “Model-based indirect learning method based on Dyna-Q architecture,” in Proceedings of 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, Oct. 2013, pp. 2540-2544. [Baidu Scholar]
K. Dehghanpour, M. H. Nehrir, J. W. Sheppard et al., “Agent-based modeling in electrical energy markets using dynamic Bayesian networks,” IEEE Transactions on Power Systems, vol. 31, no. 6, pp. 4744-4754, Nov. 2016. [Baidu Scholar]
L. B. Cunningham, R. Baldick, and M. L. Baughman, “An empirical study of applied game theory: transmission constrained Cournot behavior,” IEEE Transactions on Power Systems, vol. 17, no. 1, pp. 166-172, Feb. 2002. [Baidu Scholar]
Y. Wang, “The calculation of nodal price based on optimal power flow,” M.S. thesis, Department of Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, 2014. [Baidu Scholar]
M. N. Howell, G. P. Frost, T. J. Gordon et al., “Continuous action reinforcement learning applied to vehicle suspension control,” Mechatronics, vol. 7, no. 3, pp. 263-276, Apr. 1997. [Baidu Scholar]
X. Liu and N. Mao, “New continuous action-set learning automata,” Journal of Data Acquisition and Processing, vol. 30, no. 6, pp. 1310-1317, Nov. 2015. [Baidu Scholar]
Q. I. Rahman and G. Schmeisser, “Characterization of the speed of convergence of the trapezoidal rule,” Numerische Mathematik, vol. 57, no. 1, pp. 123-138, Dec. 1990. [Baidu Scholar]
T. Tao, Learning Automata and Its Application in Stochastic Point Location Problem. Shanghai, China: Shanghai Jiao Tong University, 2007. [Baidu Scholar]
K. S. Narendra and M. A. Thathachar, Learning Automata: An Introduction. New York: Dover Publications, 1989. [Baidu Scholar]
H. Shi, “A sample aggregation approach to experiences replay of Dyna-Q learning,” IEEE Access, vol. 6, pp. 37173-37184, Apr. 2018. [Baidu Scholar]
J. Song, J. Zhao, F. Dong et al., “A novel regression modeling method for PMSLM structural design optimization using a distance-weighted KNN algorithm,” IEEE Transactions on Industry Applications, vol. 54, no. 5, pp. 4198-4206, Sept.-Oct. 2018. [Baidu Scholar]
D. Cruz-Uribe and C. J. Neugebauer, “Sharp error bounds for the trapezoidal rule and Simpson’s rule,” Journal of Inequalities in Pure and Applied Mathematics, vol. 3, no. 4, pp. 1-22, Apr. 2002. [Baidu Scholar]
T. Li and M. Shahidehpour, “Strategic bidding of transmission-constrained GENCOs with incomplete information,” IEEE Transactions on Power Systems, vol. 20, no. 1, pp. 437-447, Feb. 2005. [Baidu Scholar]
F. Wen and A. K. David, “Optimal bidding strategies and modeling of imperfect information among competitive generators,” IEEE Transactions on Power Systems, vol. 16, no. 1, pp. 15-21, Feb. 2001. [Baidu Scholar]
R. W. Ferrero, J. F. Rivera, and S. M. Shahidehpour, “Application of games with incomplete information for pricing electricity in deregulated power pools,” IEEE Transactions on Power Systems, vol. 13, no. 1, pp. 184-189, Feb. 1998. [Baidu Scholar]