A Comparison of Contextual Optimization and Reinforcement Learning on Battery Arbitrage in Electricity Markets

Series

Research Master Defense
Speakers

Özgün Kır , Özgün Baris Kir

Location

Tinbergen Institute Amsterdam, room 1.02
Amsterdam
Date and time

July 14, 2025
11:00 - 12:00

Battery–energy-storage systems arbitrage electricity by charging when prices are low and discharging when they rise. Classical supervisory schemes solve, each hour, a 24-step linear program driven by a price forecast (predict-then-optimize, PTO) or train that forecaster with a task-aware loss (decision-focused learning, DFL). Both approaches break the multi-day coupling of the state of charge by attaching an ex-post salvage price to the residual energy at the end of every horizon. This thesis advances reinforcement learning (RL) as a sequential alternative that carries the state of charge forward and prices future flexibility explicitly through the Bellman recursion. Three PTO forecasters —ridge regression, histogram gradient boosting, and a temporal convolutional network —are benchmarked against two DFL variants with differentiable linear programming layers and four RL agents: a LinUCB contextual bandit, tabular Q-learning on a discretised Markov decision process, a neural deep-Q network, and continuous-action Soft Actor –Critic (SAC). To reduce the curse of dimensionality that burdens tabular methods, we novelly introduce predict-and-control : exogenous drivers (price, wind, solar, temperature) are first forecast, and the battery then solves a five-state MDP over its state of charge; profit-based gradients can calibrate the forecaster exactly as in DFL.

Experiments simulate a 10 MW battery on Dutch electricity market day-ahead prices. A six-month horizon case study between January and July 2019 is considered and each algorithm is tested two feature sets (price lags only versus price-plus-weather lags) and under two operational modes: hourly rolling re-optimization (Branch A) and day-ahead commitment (Branch B). Across all settings the continuous Soft Actor –Critic agent delivers the highest mean daily profit with the lowest downside volatility, outperforming the best PTO baseline by 30 –40 % and the strongest DFL model by an order of magnitude. Predict-and-control attains roughly one eighth of SAC ’s revenue while shrinking the Markov grid from 400 to 5 states and cutting training time from hours to seconds, illustrating a practical trade-off between solution quality and computational effort. Adding meteorological covariates improves forecast RMSE yet raises arbitrage profit by less than 10 %, indicating that short-horizon price history already captures most operationally relevant weather information.

The results argue, first, that value-based RL should be a front-line candidate for merchant battery control whenever extensive historical market data are available; second, that hybrid predict-and-control architectures carry potential in terms of efficiency but deserve further study for their profitability; and third, that future sequential DFL research must tame the overfitting behavior observed here if it is to close the remaining performance gap to reinforcement learning.

List of Abbreviations

BESS Battery Energy Storage System

CNN Convolutional Neural Network

DFL Decision-Focused Learning

LP Linear Program / Linear Programming

MDP Markov Decision Process

PTO Predict-Then-Optimize

RL Reinforcement Learning

SAC Soft Actor–Critic

SoC State of Charge