OpenAI Explains the Origin of "Goblins" in GPT-5: How a Personality Bug Made It Into the Model
OpenAI identified a strange verbal tic in GPT-5: the model increasingly inserted "goblins," "gremlins," and other creatures into responses. The source was…
AI-processed from OpenAI Blog; edited by Hamidun News
OpenAI in a recent breakdown explained a peculiarity that users and employees noticed across several generations of GPT-5: the model increasingly mentioned "goblins," "gremlins," and other creatures in metaphors and jokes. The company traced how this speech quirk appeared after GPT-5.1, intensified in GPT-5.4 and partially reached GPT-5.5, then showed exactly which training stage produced this effect.
How They Found the Anomaly
OpenAI first saw clear signals in November, already after the GPT-5.1 launch. Users' complaints about the model's overly familiar tone and certain repeating words prompted the investigation. One safety researcher specifically asked to check mentions of "goblin" and "gremlin" because he had encountered such formulations multiple times himself. When the team pulled the statistics, it turned out that after the GPT-5.1 release, the word "goblin" appeared in ChatGPT 175% more often, and "gremlin" 52% more often.
At first this didn't look like a serious malfunction: a single metaphor could seem harmless or even amusing. But in GPT-5.4 the spike became more noticeable, and during early testing of GPT-5.5 in Codex, employees were already en masse noting the model's strange affinity for "goblin" comparisons.
For OpenAI this was an unpleasant type of defect: not a benchmark drop or a red flag in metrics, but a small linguistic habit spreading between versions and gradually changing the style of responses.
Where the Goblins Came From
The key clue was found in the personalization function. OpenAI noticed that the "goblin" vocabulary appeared disproportionately often among users who selected the Nerdy personality mode. The mode itself accounted for only 2.5% of all ChatGPT responses, but it accounted for 66.7% of all "goblin" mentions.
In the system instruction for this personality, the model was asked to be playful, wise, a bit quirky and undercut pathos with playful language. This immediately shifted the search for the cause from the realm of conjecture into the realm of a concrete training signal.
"The world is complex and strange, and this strangeness must be
acknowledged, analyzed, and even enjoyed."
Next, OpenAI compared responses generated during RL training, with and without mentions of "goblin" or "gremlin." One reward signal stood out immediately: the one meant to reinforce the Nerdy style systematically rated "creatures" higher. An internal audit showed a positive shift in favor of such formulations in 76.2% of datasets. This explained why the quirk intensified within Nerdy, but it didn't explain why it started appearing outside this mode as well.
Here behavior transfer came into play. According to OpenAI's data, when mentions of "goblin" and "gremlin" rose within Nerdy, they rose in nearly the same relative proportion in samples without this prompt as well. In other words, a locally rewarded style began seeping into the model's more general style.
This is an important moment: the habit was being reinforced not as a feature of one personality, but as an acceptable general response technique.
The company describes the mechanism as follows:
- playful response style is rewarded
- some successful examples contain the characteristic verbal tick
- the tick begins appearing more frequently in new rollout responses
- these responses enter supervised fine-tuning and preference data
- the model reproduces the same technique even more confidently
An additional SFT data check for GPT-5.5 showed that the issue wasn't limited to just goblins. Within training examples, other "signal" creatures were found: raccoons, trolls, ogres, and pigeons. Meanwhile, the word "frog" in most cases turned out to be normal and contextually appropriate, meaning the problem wasn't with any animals or fairy-tale imagery, but with a specific entrenched speech pattern.
In other words, the anomaly's vocabulary turned out to be broader than the initial complaints suggested.
How OpenAI is Fixing It
After launching GPT-5.4, the company removed the Nerdy personality mode in March and simultaneously began fixing the training loop itself. The reward signal that especially favored "goblin" metaphors was removed from training, and data with such creature-words began to be filtered so they wouldn't overemphasize the style and wouldn't appear in inappropriate contexts.
This wasn't a cosmetic fix on the surface but an attempt to remove the source of the anomaly in the training logic itself before the effect became even more entrenched.
The company couldn't completely avoid the effect immediately: GPT-5.5 training had already begun before the team reached the root cause. That's why at the Codex testing stage, OpenAI added a separate developer instruction that suppresses such formulations. In other words, simply disabling Nerdy wasn't enough.
In effect, the company acknowledges that even a narrowly tuned reward can leak into the model's general style and survive several training iterations if the side effect isn't caught in time.
This case became a reason for researchers to build new tools for behavior auditing.
What This Means
The story about "goblins" is important not because of the goblins themselves, but because it shows a weak point in modern models: a small stylistic incentive in one personality setting can imperceptibly change the speech of the entire system.
For developers, this is a good signal that model behavior needs to be audited not only by large metrics, but also by small linguistic habits that later become systemic. It's often these small details that are the first to reveal a hidden shift in training.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.