BentoML showed how to turn Grounding DINO into a production service with a web API
BentoML outlined a practical serving scenario for Grounding DINO, a zero-shot model for object detection from a text query. The author shows how to move…
AI-processed from Habr AI; edited by Hamidun News
BentoML demonstrated a practical scenario for taking Grounding DINO from a notebook into a production service without heavy MLOps overhead. Using a zero-shot object detector as an example, the author built an HTTP API, added parameter validation, and showed how to run the service both locally and in Docker.
Why BentoML Here
The article's main idea is straightforward: training a model is not enough—you also need to deliver it properly to users. For a production scenario, you need model weight storage and versioning, resource control, a convenient API, and a clear deployment path. BentoML covers exactly this layer. The framework allows you to wrap a model's Python code into a service, automatically prepare the environment, build a Docker image, and immediately get an HTTP interface with Swagger UI.
For teams that don't want to build the entire MLOps stack manually, this significantly shortens the path from experiment to a working service.
The example uses Grounding DINO—a model for open-set object detection. Unlike classical detectors, it relies not just on an image but also on a text prompt. This means you can provide a picture and a list of descriptions like "a cat" or "a remote control," and the model will try to find exactly those objects, even if they weren't pre-defined as fixed classes. For a service scenario, this is a good use case: there's inference, text parameters, and a visual result that can be easily returned via API.
How to Build the Service
The author starts with a typical dev example using Transformers: grounding-dino-tiny is loaded, an image passes through AutoProcessor, then the model returns bounding boxes, confidence, and text labels. Next, this code is moved into a GroundingDinoService class, which is decorated as a BentoML service.
An important detail: model weights are specified as HuggingFaceModel at the class level so that BentoML downloads them in advance when building the artifact, not during container initialization. This avoids the typical error of a missing model at startup.
- The runtime configuration specifies Python 3.11 and dependencies via pyproject.toml
- The detect_image and render methods are published as HTTP endpoints using BentoML decorators
- Input parameters are described using Pydantic: prompt, box_threshold, and text_threshold
- The result can be returned as JSON with a list of boxes or as a ready image with annotations
"All this is done in a single
Python file in less than 100 lines."
Practically, the service boils down to three parts: a private _detect method with the main inference logic, a public detect_image method for structured responses, and render for visualization. This arrangement is convenient because the same logic is used both for machine clients and for people who want to quickly test the model through the interface or curl. Plus, Pydantic validation immediately filters out incorrect parameters before calling the model.
Launch and API
For local development, the bentoml serve command is used, specifying port 3025 and automatic reload mode. After startup, the service automatically raises Swagger UI, where you can upload an image, pass JSON parameters, and immediately check the response. This is a convenient debugging mode: you edit the code, save the file, and without rebuilding, you watch the service behavior change. For many teams, this development cycle is more important than a maximally optimized runtime at startup.
The production version is built via bentoml build and then containerized with the bentoml containerize command. After that, the service can be run in Docker with GPU and opened on the desired port.
In the demonstration, the service has two endpoints: /detect_image returns JSON with coordinates, classes, and confidence, while /render saves and returns an image with drawn boxes. The article also shows calls via the BentoML SDK client and via curl, meaning the service is equally convenient to connect both to internal pipelines and external applications.
What It Means
The material is useful because it grounds the conversation about model serving: instead of abstract MLOps architecture, it shows a short and reproducible path from a Python script with Grounding DINO to a container with HTTP API. For small ML teams, this is a good template if you need to quickly deliver a vision model to production without getting bogged down in infrastructure on the first release.
Need AI working inside your business — not just in your newsfeed?
I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.