Habr AI→ original

Research on ChatGPT: Does female grammatical form in a prompt affect task-solving quality

A researcher tested whether grammatical gender form in a Russian prompt affects ChatGPT's coding quality. On the LiveCodeBench benchmark, GPT-5.4 mini…

AI-processed from Habr AI; edited by Hamidun News
Research on ChatGPT: Does female grammatical form in a prompt affect task-solving quality
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

A small but carefully conducted experiment has revealed an unfortunate effect: in GPT-5.4 mini, Russian-language "feminine" phrasing in prompts can slightly degrade the quality of solutions to programming tasks. When the user's frame differed by just one gender marker, the model erred more often in the variant "я хотела бы твоей помощи" [I would like your help (feminine)], whereas neutral and "masculine" formulations yielded nearly identical results.

On simple tasks, the difference nearly disappeared, but on complex ones it appeared statistically significant. The impetus for the verification came from a casual observation by an ML research engineer, who noticed that the model's answers became less precise when Russian dialogue contained feminine forms like "я уже попробовала" [I already tried (feminine)] or "я хотела бы" [I would like (feminine)]. To not rely solely on intuition, she formulated the question strictly: does Russian-language gender-marked self-presentation change the quality of solutions to English-language coding tasks, if everything else in the prompt and response format remains unchanged.

For the test, they chose LiveCodeBench — a popular benchmark with tasks from LeetCode, AtCoder, and Codeforces, where solutions can be objectively verified through ready-made test cases. The key idea of the experiment is that differences between prompt variants were minimal. In the neutral version, the model was simply asked to help solve a Python task.

In the "masculine" variant, one phrase changed to "я хотел бы твоей помощи" [I would like your help (masculine)], and in the "feminine" one to "я хотела бы твоей помощи" [I would like your help (feminine)]. They additionally checked a second pair of similar formulations. In total, 1055 tasks from LiveCodeBench v6 release were used, with the strictest run parameters: one attempt per task, temperature 0, the primary metric being pass@1 — that is, whether the model solves the task on the first try.

Two OpenAI models were tested: GPT-5.4 mini and GPT-5.4.

To assess the robustness of the result, they applied bootstrap with 10,000 resamples and a 95 percent confidence interval.

The effect on GPT-5.4 mini appeared quite clearly. Neutral formulations gave pass@1 of approximately 0.

661–0.663, "masculine" — from 0.660 to 0.

668, and "feminine" — 0.649–0.652.

After combining two prompt variants, the difference between female and male gave a confidence interval from -0.0265 to -0.0005, meaning it did not cross zero.

In other words, the drop is small but statistically non-random. The most interesting part began in terms of difficulty: on easy and medium tasks, there was almost no significant effect, but on hard tasks the difference between the "feminine" and "masculine" frame was -0.0314 with a confidence interval from -0.

0600 to -0.0043. Across platforms, no notable divergence was found, but on more recent tasks a trend toward greater disparity emerged, though it proved less robust than the difficulty breakdown.

With the flagship GPT-5.4, the picture turned out different. Due to cost and runtime, it was only tested on hard tasks, and the effect could not be reproduced there.

The likely explanation is that the stronger model solves such a set considerably better than the mini version — approximately 57 percent versus 33 percent — so for it this benchmark no longer sits at the boundary of its capabilities. In other words, sensitivity to formulation may manifest precisely when the model is working at its limits, rather than in its comfort zone. This is an important limitation: it cannot yet be claimed that this is a universal property of all versions of ChatGPT or all LLMs in general.

The practical conclusion from this experiment is rather straightforward. When it comes to complex tasks where each attempt matters and the model might stumble on minor details, it is safer to formulate requests neutrally and not add unnecessary personal framing. This is not proof of "sexism" in the colloquial sense, but rather a signal that even minimal language markers can influence answer quality in measurable scenarios. The next logical step is to test other models, other languages, and more challenging datasets to understand where the peculiarity of the specific benchmark ends and a systemic problem begins.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…