• Graduate Program
    • Why study Business Data Science?
    • Research Master
    • Admissions
    • Course Registration
    • Facilities
    • PhD Vacancies
  • Summer School
  • Research
  • News
  • Events
    • Events Calendar
    • Events archive
    • Tinbergen Institute Lectures
    • Summer School
      • Deep Learning
      • Economics of Blockchain and Digital Currencies
      • Foundations of Machine Learning with Applications in Python
      • Machine Learning for Business
      • Marketing Research with Purpose
      • Sustainable Finance
      • Tuition Fees and Payment
      • Tinbergen Institute Summer School Program
    • Annual Tinbergen Institute Conference archive
  • Alumni
  • Magazine
Home | Events Archive | A Comparison of Contextual Optimization and Reinforcement Learning on Battery Arbitrage in Electricity Markets
Research Master Pre-Defense

A Comparison of Contextual Optimization and Reinforcement Learning on Battery Arbitrage in Electricity Markets


  • Series
    Research Master Defense
  • Speakers
    Özgün Kır , Özgün Baris Kir
  • Location
    Tinbergen Institute Amsterdam, room 1.02
    Amsterdam
  • Date and time

    July 14, 2025
    11:00 - 12:00

Battery–energy-storage systems arbitrage electricity by charging when prices are low and discharging when they rise. Classical supervisory schemes solve, each hour, a 24-step linear program driven by a price forecast (predict-then-optimize, PTO) or train that forecaster with a task-aware loss (decision-focused learning, DFL). Both approaches break the multi-day coupling of the state of charge by attaching an ex-post salvage price to the residual energy at the end of every horizon. This thesis advances reinforcement learning (RL) as a sequential alternative that carries the state of charge forward and prices future flexibility explicitly through the Bellman recursion. Three PTO forecasters —ridge regression, histogram gradient boosting, and a temporal convolutional network —are benchmarked against two DFL variants with differentiable linear programming layers and four RL agents: a LinUCB contextual bandit, tabular Q-learning on a discretised Markov decision process, a neural deep-Q network, and continuous-action Soft Actor –Critic (SAC). To reduce the curse of dimensionality that burdens tabular methods, we novelly introduce predict-and-control : exogenous drivers (price, wind, solar, temperature) are first forecast, and the battery then solves a five-state MDP over its state of charge; profit-based gradients can calibrate the forecaster exactly as in DFL.

Experiments simulate a 10 MW battery on Dutch electricity market day-ahead prices. A six-month horizon case study between January and July 2019 is considered and each algorithm is tested two feature sets (price lags only versus price-plus-weather lags) and under two operational modes: hourly rolling re-optimization (Branch A) and day-ahead commitment (Branch B). Across all settings the continuous Soft Actor –Critic agent delivers the highest mean daily profit with the lowest downside volatility, outperforming the best PTO baseline by 30 –40 % and the strongest DFL model by an order of magnitude. Predict-and-control attains roughly one eighth of SAC ’s revenue while shrinking the Markov grid from 400 to 5 states and cutting training time from hours to seconds, illustrating a practical trade-off between solution quality and computational effort. Adding meteorological covariates improves forecast RMSE yet raises arbitrage profit by less than 10 %, indicating that short-horizon price history already captures most operationally relevant weather information.

The results argue, first, that value-based RL should be a front-line candidate for merchant battery control whenever extensive historical market data are available; second, that hybrid predict-and-control architectures carry potential in terms of efficiency but deserve further study for their profitability; and third, that future sequential DFL research must tame the overfitting behavior observed here if it is to close the remaining performance gap to reinforcement learning.


List of Abbreviations


BESS Battery Energy Storage System

CNN Convolutional Neural Network

DFL Decision-Focused Learning

LP Linear Program / Linear Programming

MDP Markov Decision Process

PTO Predict-Then-Optimize

RL Reinforcement Learning

SAC Soft ActorCritic

SoC State of Charge