A Comparison of Contextual Optimization and Reinforcement Learning on Battery Arbitrage in Electricity Markets
-
SeriesResearch Master Defense
-
SpeakersÖzgün Kır , Özgün Baris Kir
-
LocationTinbergen Institute Amsterdam, room 1.02
Amsterdam -
Date and time
July 14, 2025
11:00 - 12:00
Battery–energy-storage systems arbitrage electricity by charging when prices are low and discharging when they rise. Classical supervisory schemes solve, each hour, a 24-step linear program driven by a price forecast (predict-then-optimize, PTO) or train that forecaster with a task-aware loss (decision-focused learning, DFL). Both approaches break the multi-day coupling of the state of charge by attaching an ex-post salvage price to the residual energy at the end of every horizon. This thesis advances reinforcement learning (RL) as a sequential alternative that carries the state of charge forward and prices future flexibility explicitly through the Bellman recursion. Three PTO forecasters —ridge regression, histogram gradient boosting, and a temporal convolutional network —are benchmarked against two DFL variants with differentiable linear programming layers and four RL agents: a LinUCB contextual bandit, tabular Q-learning on a discretised Markov decision process, a neural deep-Q network, and continuous-action Soft Actor –Critic (SAC). To reduce the curse of dimensionality that burdens tabular methods, we novelly introduce predict-and-control : exogenous drivers (price, wind, solar, temperature) are first forecast, and the battery then solves a five-state MDP over its state of charge; profit-based gradients can calibrate the forecaster exactly as in DFL.
Experiments simulate a 10 MW battery on Dutch electricity market day-ahead prices. A six-month horizon case study between January and July 2019 is considered and each algorithm is tested two feature sets (price lags only versus price-plus-weather lags) and under two operational modes: hourly rolling re-optimization (Branch A) and day-ahead commitment (Branch B). Across all settings the continuous Soft Actor –Critic agent delivers the highest mean daily profit with the lowest downside volatility, outperforming the best PTO baseline by 30 –40 % and the strongest DFL model by an order of magnitude. Predict-and-control attains roughly one eighth of SAC ’s revenue while shrinking the Markov grid from 400 to 5 states and cutting training time from hours to seconds, illustrating a practical trade-off between solution quality and computational effort. Adding meteorological covariates improves forecast RMSE yet raises arbitrage profit by less than 10 %, indicating that short-horizon price history already captures most operationally relevant weather information.
The results argue, first, that value-based RL
should be a front-line candidate for merchant battery control whenever
extensive historical market data are available; second, that hybrid predict-and-control
architectures carry potential in terms of efficiency but deserve further study for
their profitability; and third, that future sequential DFL research must tame
the overfitting behavior observed here if it is to close the remaining
performance gap to reinforcement learning.
List of Abbreviations
BESS Battery Energy Storage System
CNN Convolutional Neural Network
DFL Decision-Focused Learning
LP Linear Program / Linear Programming
MDP Markov Decision Process
PTO Predict-Then-Optimize
RL Reinforcement Learning
SAC Soft Actor–Critic
SoC State of Charge