OpenAI sums up Parameter Golf: how coding agents are changing machine learning research
OpenAI has summed up Parameter Golf, an open ML challenge with a 16 MB limit per artifact and 10 minutes of training on 8xH100. More than 1,000 people took part

OpenAI has summed up Parameter Golf - an open machine learning competition where participants were forced to find non-standard solutions within very strict constraints. Over eight weeks, the challenge gathered more than 1,000 participants and over 2,000 submissions, and the main surprise was how dramatically AI-agents changed the research process itself.
How the challenge was organized
The idea behind Parameter Golf was simple only on paper. Participants needed to minimize held-out loss on a fixed FineWeb dataset, while staying within a 16 MB limit for the entire artifact - including model weights and training code. There was one more constraint: training should not take more than ten minutes on eight H100 accelerators.
OpenAI deliberately chose this configuration to keep the task verifiable but avoid it becoming a simple brute-force search. Organizers provided a baseline, dataset, and evaluation scripts, and submissions were accepted via GitHub. Thanks to this format, the competition was open not only to researchers from major labs, but also to independent developers who can quickly experiment and carefully stack improvements on top of others' ideas.
OpenAI separately notes that this format turned out to be a good tool for finding strong engineers: it shows not only theoretical knowledge, but also research taste, persistence, and discipline.
What participants found
The strongest results came not from one magical idea, but from many precise technical solutions. Some squeezed quality out of already known components through fine-tuning the optimizer, initialization, and learning schedule. Some focused on compression to fit the model within the strict size limit. There were also works on the edge of what's allowed, where model improvement almost blended with evaluation strategy, so organizers had to separately check if such techniques violated the spirit of the rules.
- Fine-tuning of training: participants combined already discovered improvements and achieved even lower error without changing the core idea.
- Quantization: for the first time, GPTQ-lite and full Hessian GPTQ confidently entered the competition as ways to more aggressively compress weights after training.
- Adaptation during evaluation: some works used test-time LoRA and similar approaches while staying within formal rules.
- New data representations: non-standard tokenizers and ways to account for case or byte structure of text without loss appeared.
- Architectural moves: participants tried partial attention variants, hash features for neighboring tokens, and even layer reuse as a recurrent mechanism.
OpenAI separately highlighted the nonrecord track - a more experimental division where absolute ranking mattered less than technical boldness. There were ideas like state-space models combined with JEPA, Guided Attention, byte-level H-Net, non-autoregressive text modeling, and dynamic tokenization. At the same time, the track was not just decorative: half the entries beat the naive baseline of 1.22 BPB, and the best result reached 1.12 BPB. This is an important signal that even against strong transformer baselines, alternative approaches can still compete.
How AI-agents influenced things
The main difference between Parameter Golf and similar competitions from previous years is the mass use of coding agents. According to OpenAI, the overwhelming majority of participants mentioned working with agents. This dramatically lowered the barrier to entry: it became easier for people to set up their environment, figure out unfamiliar code, quickly test a hypothesis, and assemble a working submission without lengthy manual routine.
Additional help came from infrastructure: RunPod's sponsorship program gave participants $1 million in computing power, so more people could experiment. But along with speed came noise. Many new submissions were not independent breakthroughs, but small variations on top of already successful solutions.
In itself this is not a problem - good ideas should spread quickly. The problem is different: if a strong but invalid technique once caught attention, other agents would start copying and scaling it, continuing movement along the wrong trajectory. Because of this, verification, attribution of contribution, and correct scoring became noticeably more complex than in pre-agent programming era competitions.
The flow of work also changed the operational side of the competition. When hundreds of submissions arrived on certain days, manual analysis stopped working. So OpenAI assembled an internal triage-bot based on Codex that tracked new submissions and raised flags for manual review.
AI-agents also became part of the community: one participant along with their agent kept live updates bulletins on the leaderboard, and tools appeared around the competition that helped newcomers check their ideas for compliance with the rules.
What this means
Parameter Golf showed that AI-assisted research has already become practice, not a beautiful hypothesis. Agents accelerate entry into ML, make experiments cheaper, and expand the circle of participants, but at the same time change the very mechanics of scientific competitions. If such formats are repeated, organizers will need to design not only the task, but also a system of filtering, review, and fair accounting of contribution in a world where code is increasingly written not by one person.