How TGS and AWS Reduced Seismic AI Model Training from Six Months to Five Days
TGS and AWS accelerated training of their seismic foundation-model from six months to five days. The company deployed distributed training on SageMaker…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
TGS together with AWS demonstrated a rare result for large AI projects: training a seismic foundation model, which previously took approximately six months, was reduced to five days. At the same time, the team increased the size of the context window, that is, the volume of three-dimensional geological data that the model can analyze in a single pass. For companies working in subsurface exploration, this means faster iterations and a more complete picture of underground structures.
TGS is a provider of geoscientific data for the energy sector. The company uses seismic foundation models to analyze complex 3D arrays to find geological structures important for exploration and production. The basic model architecture is built on Vision Transformer and Masked AutoEncoder training scheme.
The main problem here is not only the size of the model itself, but also the nature of the data: seismic volumes consist of billions of points, are stored in specialized formats, and require continuous delivery to GPU without downtime. Together with AWS Generative AI Innovation Center, the company migrated training to Amazon SageMaker HyperPod and assembled a cluster of 16 Amazon EC2 P5 nodes. Each node included 8 NVIDIA H200 GPUs with 141 GB of HBM3e memory, 192 vCPU, 2 TB of RAM, and EFAv3 network with a speed of 3200 Gbps.
In total, this is 128 GPUs. According to AWS, such a configuration provided nearly linear scaling: the efficiency of parallel operation when scaling from one node to 16 remained at approximately 90–95%. A separate focus was on the data pipeline.
Instead of the classic scheme with Amazon FSx for Lustre, the team chose direct streaming from Amazon S3. The reason is simple: as the cluster grows, S3 allows each node to add its own bandwidth, whereas a shared file system becomes a bottleneck faster. The TGS training dataset is stored in the MDIO format developed by the company based on Zarr, optimized for large scientific data in the cloud.
After configuring multi-threaded loading and prefetching, each node consistently achieved 4–5 GB/s, and the entire cluster — 64–80 GB/s. Additionally, this reduced storage infrastructure costs by more than 90%. For distributed training, the team compared several approaches: DeepSpeed ZeRO-2, ZeRO-3, and FSDP2.
ZeRO-2 showed the best balance between speed and memory savings: 1,974 samples per second compared to 1,833 for FSDP2 and 869 for ZeRO-3. This is important because in such tasks, maximum memory savings does not always deliver the best overall result: if GPU communication becomes too expensive, throughput drops sharply. In this project, the focus was not on theoretically the densest option, but on the configuration that actually processes training faster in a production scenario.
The second key achievement relates to expanding the context window. For seismic models, this directly affects analysis quality: the larger volume of rock the model sees in a single pass, the better it captures both local details, like small fractures, and large patterns — for example, fault systems at the scale of an entire basin. Due to context parallelism and ring attention adaptation for Vision Transformer architecture for 3D data, the maximum input size increased from 640 x 640 x 1,024 to 1,536 x 1,536 x 2,048 voxels.
Context length increased from 102,400 to 1.17 million tokens, and the analyzed volume — approximately 4.5 times.
The practical significance of the case is that TGS can now update models not once every six months, but effectively weekly, faster connect new geological data, and provide clients with a broader analysis context. For AWS, this is another demonstrative example that specialized foundation models in science and industry are limited not only by model architecture, but also by proper organization of data, networks, and distributed training. For the market as a whole, the signal is clear: narrow specialized AI is beginning to win where long computational cycles and overly expensive infrastructure used to dominate.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.