💡Model Inference Server
What about it ?
2 min readJan 29, 2022
- Models are getting bigger. (Model size now range from MBs to TBs). Innovation in Deep learning is rapid. More and more use-cases are using models in their execution path.
- Model execution has seen great benefits from heterogenous hardware backends (accelerator): CPUs, GPUs, FPGA
- Models are used in critical services and at large scale (even million qps): model deployment is a multi tenant problem. Routing requests efficiently to servers is a necessity.
Keeping these in mind at production scale, there is a genuine need for model inference server. It must be able to support fast model validation, deployment, and proper version control.
Architecture & Component responsibilities :

- Model Master : Singleton orchestrator which deploys model to one or many model servers by factoring in model requirements and hardware resources. For large ads model, it might not be possible to deploy model to a single model server. In that case the model is transformed (split) and deployed to multiple model servers. Model master can perform fast model validation and version control.
- Model Server : It consists of Model Loader (within Model Loader), Model containers and Inference request executor.
By isolating model instances in containers, you ensure that they don’t interfere with each other.
Reference :