Optimize Throughput and Latency
The NuPIC Inference Server offers flexible model performance tuning options through the model configuration file. Usually there are two use cases, optimizing for throughput or optimizing for latency.
Optimizing for throughput
For better throughput, we recommend following configurations:
-
Setting the
instance_group.count
property to the same number of cores in the deployment server. The configuration property controls how many model instances will be running concurrently. -
Setting the environment variable
CAJAL_NUM_THREADS=0
. TheCAJAL_NUM_THREADS
option, available at server startup time, will dictate the number of threads that each deployed model can use. the default 0 value means that the number of threads will be 1 for each model, and it will not try to use multithreading. A value of 1, on the other hand, means that each model will be multithreaded, but it will only have a single thread, which creates unecessary overhead, so avoid settingCAJAL_NUM_THREADS=1
. All values other than 1 will offer considerable performance results. To adjustCAJAL_NUM_THREADS
, edit thedocker run
command in thenupic/nupic_inference.sh
script. Then stop and restart the server. -
Utilize NUMA node optimization when applicable.
On larger machines with more than one NUMA node (determined by runninglscpu
), further optimization can be achieved by confining each model instance to a single NUMA node. Follow these steps:i. Run
lscpu
, and note the "NUMA" section.ii. If there is just 1 NUMA node, then there is nothing to be done.
iii. For each NUMA node, note the list of CPUs. Usually just the first set of CPUs in the list corresponds to the physical cores, use only those. For example if it says
NUMA node0 CPU(s): 0-47,96-143
, use only CPUs 0-47.iv. For each NUMA node, edit the
scripts/triton.sh
script to add ahost-policy
. This consists of two additional parameters to thetritonserver
command for each NUMA node:--host-policy=numa0,numa-node=0 --host-policy=numa0,cpu-cores=0-47
.v. In the
[config.pbtxt](doc:configuration)
for the model, create a separateinstance_group
for each NUMA node and set itshost_policy
to that node's policy name. Make sure thecount
does not exceed each NUMA node's number of physical cores. -
Utilize dynamic batching
Putting everything together, let's go over an example of 96 cores with 2 NUMA nodes, such as an AWS c7i.metal-48xl
EC2 instance. We want to maximize the overall throughput by running 96 instances, with 48 on each NUMA node. Here's the instance_group
property in model configuration file.
instance_group [
{
count: 48
kind: KIND_CPU
host_policy: "numa0"
},
{
count: 48
kind: KIND_CPU
host_policy: "numa1"
}
]
Optimizing for latency
To optimize for latency, we recommended setting the number of models equal to the number of cores divided by 4 (num_cores/4) and setting CAJAL_NUM_THREADS=4
. In the same AWS instance, that would be translated to 12 models and CAJAL_NUM_THREADS=4
.
References:
- Model configuration - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md
- Model architecture - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md
- Model management - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md
Updated 7 months ago