Gazprom Neft: predictive maintenance cut drilling downtime by 41%

Total downtime across the 1,240-rig fleet dropped 41%. In absolute terms: 188,000 rig-operating-hours per year recovered. At 7.7M rubles average daily revenue per rig, that's +60 billion rubles of additional production per year. The largest ROI of any ML project in the Russian oil & gas sector. Emergency "between-plan" failures dropped from 68/year to 9 — the top downtime killers (bearings, pumps, top drive) are now predicted 200+ hours out. Parts spending decreased 12% (less overmaintenance), though the project initially feared the opposite — that predictive replacement would increase spend. In practice — replace surgically and less often. The biggest challenge: migrating engineering culture. Experienced rig managers drilling since the 1990s initially rejected model recommendations. "AI as intern" framing helped: a junior engineer on shift can cite "AI recommends" as a second opinion, giving political cover against a senior. After 14 months the culture flipped — a rig WITHOUT predictive maintenance now feels under-equipped.

-41%

downtime

₽60B

extra production/yr

68→9

emergency failures/yr

200+ч

early warning

Background

Gazprom Neft operates 1,240 drilling rigs across 9 regions from Yamal to Irkutsk. Each rig is a complex of 380+ sensors (pressure, vibration, bearing temperature, drilling-fluid flow, drill bit torque). Average daily output across the fleet: 360,000 barrels. One hour of downtime per rig costs the company 320,000 rubles: lost production + crew salaries + equipment lease + supply contract penalties. Through 2024, scheduled maintenance was "by calendar": every N hours — regardless of actual condition.

Problem

Calendar-based maintenance created two equally bad problems. First — overmaintenance: 38% of scheduled interventions were premature. A crew helicoptered 240km to the rig, disassembled a healthy unit, put known-good bearings back. Cost of such a wasted trip: 1.8 million rubles.

Second — failures between scheduled stops. A bearing scheduled to last 4,000 hours started making noise at 3,100h and catastrophically failed at 3,400h. That's not planned replacement but emergency: helicopter, scramble, "hot-job" mode disassembly, a week of downtime. Roughly one such case per year per 18 rigs.

Solution

Gazprom Neft built a two-tier predictive maintenance system. Tier 1: edge ML on the rig itself. A microcontroller on each installation (24× ARM Cortex M7) runs compact models (LSTM + 1D-CNN, 12MB each) on streaming sensor data. Goal: detect early failure signs — vibration spectral analysis identifies bearing resonances 200-400 hours before catastrophic failure.

Tier 2: a cloud "digital twin" of each rig. Data stream from 380 sensors aggregates in real-time and feeds a large model (Temporal Convolutional Network in PyTorch, 230M parameters). The model compares the rig's current "behavior" with historical samples of successful and failed scenarios from 7,000 rig-years of data. If rig X now "looks like" rig Y that failed in 70 hours — an alert is generated with a specific prediction and recommended intervention.

Critical: explainability. Field engineers don't trust "magical" AI. The team made every warning arrive with three reference charts: "here's your current vibration profile, here's the profile a week before failure on a similar rig in 2022, here's a normal profile". The engineer sees and decides. Recommendation acceptance: 84%.

Result

Emergency "between-plan" failures dropped from 68/year to 9 — the top downtime killers (bearings, pumps, top drive) are now predicted 200+ hours out. Parts spending decreased 12% (less overmaintenance), though the project initially feared the opposite — that predictive replacement would increase spend. In practice — replace surgically and less often.

The biggest challenge: migrating engineering culture. Experienced rig managers drilling since the 1990s initially rejected model recommendations. "AI as intern" framing helped: a junior engineer on shift can cite "AI recommends" as a second opinion, giving political cover against a senior. After 14 months the culture flipped — a rig WITHOUT predictive maintenance now feels under-equipped.

Technology stack

Edge: 1D-CNN + LSTM (12MB)ARM Cortex M7 (24 cores per rig)Cloud: Temporal Convolutional Network (230M params)PyTorch + TorchServeKafka (sensor stream)TimescaleDBYandex Cloud (10K vCPU)

Timeline

Pilot on 8 rigs: 7 months. Rollout to 200 rigs: another 11 months. Full coverage of 1,240 rigs: 26 months. Continuous training every month on new data.

Team

53 человека: ML (14), edge embedded (9), data engineers (8), domain experts (8), MLOps (6), integration (5), product (3)

Lessons learned

Edge + cloud are two different models, not one. Edge catches early signals in milliseconds; cloud does long-range forecast.
Explainability matters more than 5% accuracy. Engineers must see why AI thinks so, or they won't accept it.
"AI as intern" is a psychological hack for adoption: a junior engineer uses AI as political cover.
Overmaintenance is also expensive: 38% premature interventions × 1.8M rubles = serious money.
Sensitivity vs lead time is a tradeoff. Too-long warning = false alarms; too-short = can't react in time.

← Cases