Infrastructure sizing and architecture
We select GPUs, servers, clustering, high availability and disaster recovery. We work with, among others, NVIDIA H100, H200, A100 and L40, and Lenovo ThinkSystem GPU servers.
We deploy local language models - such as Llama, Mistral, Qwen, DeepSeek or Phi - in the client’s infrastructure. This is a solution for organisations that need full control over their data, reduced dependence on the public cloud, and the ability to run AI in an isolated environment.
Critical data, documents, prompts and answers can stay within the organisation’s infrastructure. This reduces the risk of information being sent uncontrolled outside the organisation’s own environment and allows the AI architecture to be better matched to security and compliance requirements.
The organisation decides which model is used, in which version, on which infrastructure, and with what constraints. It is possible to manage model versions, inference configuration, security parameters, quality monitoring and the way the environment is updated.
In an on-premise model, cost is based mainly on infrastructure, maintenance and environment development, rather than solely on token-based billing. At high query volumes and over a long usage horizon this can offer greater cost predictability than token-billed services.
LLM on-premise helps meet the requirements of organisations that must exercise particular control over data processing: banks, healthcare entities, the public sector, defence and critical infrastructure. The architecture can include environment separation, access control, usage logging, an audit trail, and no dependency on a public API.
We select GPUs, servers, clustering, high availability and disaster recovery. We work with, among others, NVIDIA H100, H200, A100 and L40, and Lenovo ThinkSystem GPU servers.
We deploy Llama, Mistral, Qwen, DeepSeek and Phi in variants matched to performance and cost requirements. This can include 4-bit or 8-bit quantisation and serving via vLLM, TGI or Ollama.
We build RAG solutions based on local vector databases such as pgvector, Qdrant or Weaviate. Data, documents and embeddings stay within the same controlled environment as the model.
We deliver LoRA, QLoRA or full fine-tuning for scenarios where a generic model is not sufficient. This includes, among others, industry terminology, internal procedures and specific document classes.
We design monitoring of answer quality, latency, throughput and running cost. We implement model update processes, version control and an audit trail.
Before production launch we carry out an AI Security Review covering prompt-injection testing, tool sandboxing, retrieval validation, and securing the GPU infrastructure.
We start with an analysis of the use case. We check which model will be sufficient, what the requirements are for latency, throughput and SLA, and what data should be available to the model.
On this basis we prepare infrastructure sizing, a model recommendation and a solution architecture. We also determine whether the right path is RAG, fine-tuning, a hybrid model, or a classic open-weight model deployment.
We typically deliver the first MVP within an 8-12 week horizon. We measure answer quality, performance, query handling cost, and how well the model fits the business scenario. Before production launch we implement monitoring, model versioning, maintenance procedures and an AI Security Review.
Technology stack
The team’s experience in AI, enterprise infrastructure and on-premise environments confirms SNOK’s readiness to deliver private AI infrastructure projects.
Financial sector bank
LLM on-premise deployment for working with compliance documentation. The solution was based on a 70B model running on H100 GPU infrastructure and fine-tuned to the organisation’s internal procedures.
Public sector entity
A local LLM environment for working with classified data. The project included an isolated architecture, access control, and no external data egress.
Critical infrastructure operator
An LLM assistant for OT/SCADA teams, supporting work with technical documentation and NIS2 requirements. The solution was designed with access control, data security and on-premise environment requirements in mind.
For many enterprise use cases, local models can be sufficient, particularly when they are well chosen, run on suitable infrastructure, and connected to the organisation’s knowledge through RAG or fine-tuning. For the most demanding tasks, cloud models may still have an edge, but the quality gap keeps narrowing across many scenarios.
This depends on the model, expected throughput, latency and deployment mode. Llama 3.3 8B can run on a single A100 or L40 class GPU. Llama 3.3 70B in 4-bit quantisation may need 1x H100 80GB or 2x A100. A full 70B FP16 model may require a cluster of 4x H100. SNOK carries out sizing before any infrastructure purchase.
At low scale, LLM on-premise will usually be more expensive because of infrastructure CAPEX. At high scale, millions of tokens a day, and over a long usage horizon, it can be more cost-effective. An additional benefit is control over data and compliance with requirements that public AI services may not meet.
Some providers, including Anthropic and OpenAI, do not offer their models for classic on-premise deployment. Open-weight models such as Llama, Mistral, Qwen, DeepSeek and Phi are available instead. Scenarios with Azure OpenAI in a private Azure tenant are also possible, if the organisation’s security model allows it.