BentoML showed how to turn Grounding DINO into a production service with a web API

BentoML outlined a practical serving scenario for Grounding DINO, a zero-shot model for object detection from a text query. The author shows how to move…

Hamidun News Editorial

AI monitoring · Habr AI

May 2, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

BentoML showed how to turn Grounding DINO into a production service with a web API — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

BentoML demonstrated a practical scenario for taking Grounding DINO from a notebook into a production service without heavy MLOps overhead. Using a zero-shot object detector as an example, the author built an HTTP API, added parameter validation, and showed how to run the service both locally and in Docker.

Why BentoML Here

The article's main idea is straightforward: training a model is not enough—you also need to deliver it properly to users. For a production scenario, you need model weight storage and versioning, resource control, a convenient API, and a clear deployment path. BentoML covers exactly this layer. The framework allows you to wrap a model's Python code into a service, automatically prepare the environment, build a Docker image, and immediately get an HTTP interface with Swagger UI.

For teams that don't want to build the entire MLOps stack manually, this significantly shortens the path from experiment to a working service.

The example uses Grounding DINO—a model for open-set object detection. Unlike classical detectors, it relies not just on an image but also on a text prompt. This means you can provide a picture and a list of descriptions like "a cat" or "a remote control," and the model will try to find exactly those objects, even if they weren't pre-defined as fixed classes. For a service scenario, this is a good use case: there's inference, text parameters, and a visual result that can be easily returned via API.

How to Build the Service

The author starts with a typical dev example using Transformers: grounding-dino-tiny is loaded, an image passes through AutoProcessor, then the model returns bounding boxes, confidence, and text labels. Next, this code is moved into a GroundingDinoService class, which is decorated as a BentoML service.

An important detail: model weights are specified as HuggingFaceModel at the class level so that BentoML downloads them in advance when building the artifact, not during container initialization. This avoids the typical error of a missing model at startup.

The runtime configuration specifies Python 3.11 and dependencies via pyproject.toml
The detect_image and render methods are published as HTTP endpoints using BentoML decorators
Input parameters are described using Pydantic: prompt, box_threshold, and text_threshold
The result can be returned as JSON with a list of boxes or as a ready image with annotations

"All this is done in a single

Python file in less than 100 lines."

Practically, the service boils down to three parts: a private _detect method with the main inference logic, a public detect_image method for structured responses, and render for visualization. This arrangement is convenient because the same logic is used both for machine clients and for people who want to quickly test the model through the interface or curl. Plus, Pydantic validation immediately filters out incorrect parameters before calling the model.

Launch and API

For local development, the bentoml serve command is used, specifying port 3025 and automatic reload mode. After startup, the service automatically raises Swagger UI, where you can upload an image, pass JSON parameters, and immediately check the response. This is a convenient debugging mode: you edit the code, save the file, and without rebuilding, you watch the service behavior change. For many teams, this development cycle is more important than a maximally optimized runtime at startup.

The production version is built via bentoml build and then containerized with the bentoml containerize command. After that, the service can be run in Docker with GPU and opened on the desired port.

In the demonstration, the service has two endpoints: /detect_image returns JSON with coordinates, classes, and confidence, while /render saves and returns an image with drawn boxes. The article also shows calls via the BentoML SDK client and via curl, meaning the service is equally convenient to connect both to internal pipelines and external applications.

What It Means

The material is useful because it grounds the conversation about model serving: instead of abstract MLOps architecture, it shows a short and reproducible path from a Python script with Grounding DINO to a container with HTTP API. For small ML teams, this is a good template if you need to quickly deliver a vision model to production without getting bogged down in infrastructure on the first release.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →