MarkTechPost→ original

NVIDIA Open-SWE-Traces: data preparation for fine-tuning coding agents

NVIDIA has released Open-SWE-Traces, a dataset with thousands of real AI agent sessions solving programming tasks. The researchers explained how to stream…

AI-processed from MarkTechPost; edited by Hamidun News
NVIDIA Open-SWE-Traces: data preparation for fine-tuning coding agents
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

NVIDIA has released the Open-SWE-Traces dataset — a collection of real multi-step sessions from AI agents solving software development tasks. The tutorial walks through the complete pipeline: from data streaming to a ready-made dataset for supervised fine-tuning.

What is Open-SWE-Traces

The dataset contains thousands of agent trajectories: each is a complete session record where an AI receives an engineering task, progressively calls tools (reading files, running tests, searching code), iterates through the solution, and outputs a final patch. This is fundamentally different from typical question-answer datasets: here it captures not just what the result was, but how the agent arrived at it.

Each record contains structured metadata:

  • trajectory length — number of agent steps
  • list of used tools and call frequency
  • size of final diff in lines of code
  • programming language of the task
  • flag for successful or unsuccessful solution

The data is hosted on Hugging Face and supports streaming — you can work with the dataset in Google Colab without a full download, which is important given the volumes of several gigabytes.

How the Pipeline is Built

The tutorial walks through several processing stages. The first is dialog normalization. Multi-step agent sessions are converted to a unified format: user messages, agent responses, and tool calls are aligned into a sequence. This is necessary because different agent versions log sessions differently.

The second is patch parsing. The code changes themselves are extracted from the agent's final output in unified diff format. This patch becomes the 'answer' in the training example.

The third is assembling an analytical DataFrame. For each trajectory, key metrics are calculated: token budgets at different stages of agent operation, distribution across tools, success statistics by language and task types.

Filtering for SFT

The final step is selecting examples for training. The authors apply a chain of filters.

By success labels — only trajectories with successful solutions enter the sample. Training on failed sessions without special markup is risky: the model will pick up incorrect patterns.

By tokens — trajectories longer than the specified limit are filtered out. Examples that are too long don't fit in the context with standard training settings.

By language — if you need a specialized agent for Python or JavaScript, filtering retains only relevant examples.

By patch presence — sessions without final code are useless for the SFT task, where the model must learn to output a specific result.

"The quality of training data is more important than quantity —

especially for agent traces, where failed sessions can cement bad patterns in the model".

What This Means

NVIDIA's Open-SWE-Traces is one of the first public datasets with real agent trajectories for engineering tasks. The tutorial provides a working template: from raw data on Hugging Face to a ready-made SFT dataset in a few lines of code. For teams building their own code-writing agents, this is a ready-made starting point without needing to gather data from scratch.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

What do you think?
Loading comments…