How to Run DeepSeek on Your Server: Memory, Config, and Complete Privacy
Tired of trusting your queries to public APIs? It's time to deploy DeepSeek on your own cloud server. The 7B model in Q4 format requires just 6-8 GB VRAM…
AI-processed from Habr AI; edited by Hamidun News
Public LLM services are convenient, but they have a systemic flaw — you don't know what happens to your data. DeepSeek can be deployed on your own cloud server: conversations remain only with you, regional bans don't exist, and price predictability becomes the norm.
Why move to your own server
The problem with public APIs goes beyond cost. Even on paid tiers, you're essentially sending requests to someone else's infrastructure without privacy guarantees. Alibaba, OpenAI, and other vendors have their own data usage policies — and you have no control over what happens to your data on their servers. Some companies explicitly include in their Terms of Service the right to use conversations for further training if not explicitly disabled. Add strict censorship in some models — where responses to perfectly valid requests are unavailable without explanation — and sudden geo-blocking, and you get a business process tied to a public API that becomes vulnerable to external vendor decisions.
Self-hosted solves all these problems:
- Privacy: requests never leave your infrastructure
- No censorship: the model operates without external content restrictions
- No regional blocks: service is accessible from any country
- Predictable costs: pay only for the cloud instance, no surprise rate changes
- Full control: fine-tuning, retraining, integration into your own products
Memory and hardware requirements
The key parameter when choosing a configuration is VRAM volume. It depends on model size and quantization level. DeepSeek-R1 in the 7B variant in Q4 format requires around 6-8 GB VRAM — even a budget cloud GPU can handle this.
The 14B version takes 10-12 GB, 32B — about 20 GB. Full DeepSeek-V3 with 685 billion parameters in 8-bit quantization will require hundreds of gigabytes — that's already GPU cluster territory. For most working scenarios, 7B or 14B variants are optimal: a reasonable balance between answer quality and infrastructure cost.
Running on CPU is possible, but significantly slower — the recommended minimum is 32 GB RAM and fast NVMe storage. Q4 quantization reduces memory requirements by approximately half compared to FP16 with minimal quality loss.
How to set up and configure
The simplest path is Ollama. It installs in one command on Linux, macOS, or Windows; the model downloads via `ollama pull deepseek-r1:7b`. The service automatically launches a REST API on port 11434 with an OpenAI-compatible interface — Open WebUI, Cursor, n8n, and most popular clients connect to it without additional configuration. For production with high loads, vLLM is better suited: it supports batching, parallel requests, and multiple GPUs simultaneously. llama.cpp provides maximum flexibility — works on any platform, supports all GGUF quantization formats, and consumes minimal resources. Both options provide an OpenAI-compatible API.
Several parameters are critical from the first launch:
- `context_length` — set it for your tasks; the default value is often insufficient for long conversations
- `num_threads` — for CPU mode, set it equal to the number of physical cores, not logical ones
- `gpu_layers` — number of model layers offloaded to GPU; requires experimental tuning
- `temperature` and `top_p` — affect answer determinism, important for production
"The main advantage of self-hosted LLM is predictability.
No surprises with access being cut off, unexpected policy changes, or censorship in the next update."
What this means
Self-hosting LLM is no longer the domain of enthusiasts. Deploying DeepSeek on a cloud server today is a task for several hours even without deep DevOps experience. For companies working with confidential data, it's no longer an alternative to public APIs — it's a practical necessity.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.