Google AI Blog→ original

Google added Flex and Priority modes to Gemini API for price-reliability balance

Google added two new modes to Gemini API: Flex and Priority. Flex is designed for background tasks and promises up to 50% savings compared to Standard API…

AI-processed from Google AI Blog; edited by Hamidun News
Google added Flex and Priority modes to Gemini API for price-reliability balance
Source: Google AI Blog. Collage: Hamidun News.
◐ Listen to article

On April 2, 2026, Google added two new service tiers to Gemini API — Flex and Priority, allowing developers to more precisely manage cost, latency, and reliability without complicating architecture. The idea is that background and critical user requests can now be routed to different service levels through the same synchronous interface, rather than building separate pipelines for Standard API and Batch API. The company describes the problem in quite practical terms.

As AI scenarios move away from simple chatbots toward agents and compound workflows, teams typically face two classes of workload. The first is background tasks: bulk data enrichment, prolonged model reasoning, research runs, CRM updates, and other processes where extra seconds are not critical. The second is interactive requests: user chats, copilots, real-time moderation, support bots, and other functions where stable response and predictable latency matter.

Previously, this division often required combining regular synchronous requests from the product side with Batch API for cheap background processing. This provided savings, but added overhead: you had to manage asynchronous jobs, input and output files, and polling for execution status. At Google, they say Flex and Priority close this gap: both options work through standard synchronous endpoints, and switching happens via the service_tier parameter in the request.

Flex is a new economical mode for tasks that can tolerate latency and lower execution priority. Google promises savings of up to 50% compared to Standard API if the developer is willing to sacrifice some reliability and response speed for cost. The key point is that Flex does not turn work into a separate batch process: it is still a synchronous request with a familiar integration pattern.

The company suggests using this mode for background CRM updates, large-scale research simulations, and agent scenarios where the model can "think" or "review" information in the background. According to Google, Flex will be available on all paid tiers and is supported in GenerateContent and Interactions API requests. Priority, by contrast, is designed for the most sensitive traffic.

It is a premium mode with maximum guarantee level, intended to help applications handle peak loads without displacing critical requests. Google states directly that such requests receive the highest criticality level, meaning there is a better chance of maintaining stable operation even when the platform is under load. Another important detail is the soft degradation mechanism: if an application exceeds Priority limits, excess requests do not fail with an error but are automatically handled at Standard level.

For production, this may be more important than the SLA itself, because it reduces the risk of complete function degradation during user spikes. At the same time, Google makes Priority mode more transparent from an operational and billing perspective. The API response will indicate which exact processing level handled the specific request, so the team can analyze system behavior, calculate costs, and track real degradation scenarios.

Among typical use cases, the company names real-time support bots, live-moderation pipelines, and any latency-sensitive requests. At launch, Priority will be available for paid projects at Tier 2 and Tier 3 levels in GenerateContent API and Interactions API. For developers, this update matters not just because of prices.

Google is essentially trying to simplify the engineering choice between "cheap" and "reliable," without forcing product teams to build two different integration models. If Flex truly delivers the promised 50% savings on background tasks without switching to batch architecture, it could reduce the cost of agent scenarios and mass pipelines. And if Priority consistently keeps critical traffic stable during peak hours, Gemini API will gain a stronger argument for consumer products where outages directly impact revenue and user experience.

The main takeaway is simple: Google is turning Gemini API from a single standard channel into a more flexible system of service classes. For teams, this means the ability to consciously divide background and critical workload on the same API, better calculate unit economics, and more easily handle peak periods. If the approach takes hold, competition between AI platforms will increasingly be waged not only on model quality, but on how finely the provider can sell performance, reliability, and cost tailored to different product scenarios.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…