Habr AI showed that reinforcement learning still trails classical optimization in logistics
Habr AI examined how reinforcement learning behaves in an applied logistics task — choosing refueling stops along a route. For the experiment, the team built…
AI-processed from Habr AI; edited by Hamidun News
Habr AI published a detailed experiment on whether reinforcement learning can replace classical methods of mathematical optimization in applied logistics. The check turned out to be sobering: RL is already capable of solving a structured task, but in terms of solution quality it still falls short of a solver.
How the task was set
At the center of the experiment was a quite down-to-earth business problem: how to plan refueling stops for cargo vehicles along a route to reduce fuel costs. For transporters this is a sensitive cost item, and the price variation between gas stations provides real room for optimization. It's not enough to simply choose the cheapest points—you need to tie the solution to route constraints, tank capacity, and operational requirements. The author chose exactly this case because it's closer to real logistics than textbook problems like TSP, and clearly shows the boundary between academic RL and applied optimization.
- Minimum fuel reserves cannot be below a threshold on any segment
- Tank volume must not exceed maximum capacity
- At the end of the route, a specified fuel reserve must remain
- Stopping at a gas station only makes sense with a minimum justified refueling volume
To adapt the task for RL, the refueling volume had to be discretized. Instead of continuous choice, the agent was given five actions: refuel 0%, 25%, 50%, 75%, or 100% of the free tank space. In parallel, the same problem was formulated as a nonlinear programming task and solved with the classical SCIP solver. This created a clear baseline: you don't have to guess whether the agent is learning—you can compare it against a practically optimal solution in the same formulation.
How the agent was trained
For the experiment, they built their own RL environment, since ready-made sandboxes for such a task don't exist. Agent state was described by a vector containing future fuel consumption between gas stations, fuel prices, and tank constraints. Since route lengths vary, the vector was brought to a fixed size: data was padded with zeros and then normalized so the model wouldn't get confused by scales. As a result, the agent saw at each step the current fuel level, future fuel need, available prices, and the required reserve at the finish.
The reward was built around refueling cost, with penalties added for constraint violations. As the algorithm, they chose a combination of Dueling DQN and Double DQN: the first scheme separates state value from action advantage, the second reduces Q-value overestimation and makes learning more stable. The author tested two network architectures—fully connected and one-dimensional convolutional—and also added a replay buffer, decaying exploration, and curriculum learning with expert episodes, where the optimal strategy was partially suggested by the classical model.
What the test showed
With real data, a typical business problem emerged: the history was short, records were duplicated, and log collection hadn't been prepared for training. So training was moved to a synthetic dataset tuned to the variation of real routes. On the graphs both neural network architectures quickly converged to roughly the same average reward around -7. Neither a longer exploration phase nor adding expert actions nor reward retuning gave noticeable improvement. That is, the agent stabilized but didn't start making noticeably stronger decisions.
The most interesting part came when comparing with mathematical optimization on 86 real routes. RL models in total spent more and refueled more than the baseline from the solver, with a cost gap of 8% to 54% depending on the training variant. The Overload modification, which more heavily penalized excess fuel at the end of the route, came closest to optimum.
Meanwhile, RL had an unexpected advantage: its average fuel purchase price was lower. The problem is that the agent compensated for this with excess fuel and didn't try to complete the route with a reserve close to required. It handled minimum reserve constraints reasonably well, and RL inference was faster than the solver, but accounting for about an hour of training, the advantage of the classical approach remains.
What this means
Habr AI's experiment doesn't eliminate RL in optimization, but puts it in its place. For well-formalized logistics tasks, classical mathematical programming is still more reliable, cheaper in labor costs, and more accurate in results. Real prospects for RL are seen more in hybrid scenarios: as an accelerator, as a generator of initial solutions, or as an adaptation layer where the environment is too dynamic for a fixed model.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.