Together AI Blog→ original

DSGym: A Framework for Training Data Science Agents with 90+ Scientific Tasks

Together AI released DSGym — a unified framework for training and evaluating LLM agents that perform data science tasks. It combines 90+ bioinformatics tasks fr

AI-processed from Together AI Blog; edited by Hamidun News
DSGym: A Framework for Training Data Science Agents with 90+ Scientific Tasks
Source: Together AI Blog. Collage: Hamidun News.
◐ Listen to article

Together AI released DSGym — a unified framework for evaluating and training LLM agents that solve data science tasks. Existing benchmarks rely on incompatible interfaces, and many tasks can be solved without actual data analysis. DSGym solves this problem by integrating 192 new tasks from bioinformatics and Kaggle into a single ecosystem with synthetic data generation for training.

Why Existing Benchmarks Don't Work

The current approach to evaluating data science agents suffers from fragmentation. Different benchmarks use incompatible APIs, data formats, and evaluation metrics, making fair comparison and integration into a single system difficult. Implementing support for each benchmark from scratch is costly and requires rewriting code. Moreover, many tasks in existing benchmarks can be solved without data analysis. An agent can simply guess the result, find the answer online, or apply a template solution that doesn't require understanding the actual problem.

How DSGym Works

DSGym solves this problem through a unified JSON interface. Each task is described by four components: dataset, query text, evaluation metric, and metadata. This allows adding new tasks, tools, and agent strategies without redesigning the entire framework. Agent code runs in containers that are provisioned in real-time with pre-installed dependencies. This architecture guarantees security (isolated environment), reproducibility (always consistent state), and fair evaluation (agents run in production environment, not development).

What Tasks Are Included in DSGym

The framework is divided into two main categories:

  • Data Analysis — finding answers to questions through programmatic analysis of structured data
  • Data Prediction — developing end-to-end ML pipelines for forecasting and classification
  • DSBio — 90 bioinformatics tasks extracted from published scientific papers
  • DSPredict — 92 Kaggle competitions, including time series, computer vision, and molecular modeling
  • MLEBench and QRData — integrated classic benchmarks from prior work

Synthetic data for training is generated through a specialized pipeline. The system executes queries, records complete solution trajectories, and creates examples in the form (task, code, result). From 3,700 automatically generated examples, the authors selected 2,000 high-quality ones through LLM filtering.

Results: SOTA Among Open-Source Models

A 4-billion parameter model trained on synthetic data achieved state-of-the-art performance among open-source LLMs for data science. This demonstrates that high-quality synthetic data generated by the framework is sufficient to train competitive agents without relying on proprietary datasets.

What This Means

DSGym transforms data science agents from a research topic into a practical tool. A unified platform and synthetic data generation mechanism lower the barrier to entry — now any team can train their own agent without millions of examples. For startups, research labs, and internal teams, this opens the opportunity to quickly prototype and improve automated data analysis systems.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…