The NuPIC Inference Server offers flexible model performance tuning options through the model configuration file. Usually there are two use cases, optimizing for throughput or optimizing for latency.

Optimizing for throughput

For better throughput, we recommend following configurations:

Setting the instance_group.count property to the same number of cores in the deployment server. The configuration property controls how many model instances will be running concurrently.
Setting the environment variable CAJAL_NUM_THREADS=0. The CAJAL_NUM_THREADS option, available at server startup time, will dictate the number of threads that each deployed model can use. the default 0 value means that the number of threads will be 1 for each model, and it will not try to use multithreading. A value of 1, on the other hand, means that each model will be multithreaded, but it will only have a single thread, which creates unecessary overhead, so avoid setting CAJAL_NUM_THREADS=1. All values other than 1 will offer considerable performance results. To adjust CAJAL_NUM_THREADS, edit the docker run command in the nupic/nupic_inference.sh script. Then stop and restart the server.
Utilize NUMA node optimization when applicable.
On larger machines with more than one NUMA node (determined by running lscpu), further optimization can be achieved by confining each model instance to a single NUMA node. Follow these steps:

i. Run lscpu, and note the "NUMA" section.

ii. If there is just 1 NUMA node, then there is nothing to be done.

iii. For each NUMA node, note the list of CPUs. Usually just the first set of CPUs in the list corresponds to the physical cores, use only those. For example if it says NUMA node0 CPU(s): 0-47,96-143, use only CPUs 0-47.

iv. For each NUMA node, edit the scripts/triton.sh script to add a host-policy. This consists of two additional parameters to the tritonserver command for each NUMA node: --host-policy=numa0,numa-node=0 --host-policy=numa0,cpu-cores=0-47.

v. In the [config.pbtxt](doc:configuration) for the model, create a separate instance_group for each NUMA node and set its host_policy to that node's policy name. Make sure the count does not exceed each NUMA node's number of physical cores.
Utilize dynamic batching

Putting everything together, let's go over an example of 96 cores with 2 NUMA nodes, such as an AWS c7i.metal-48xl EC2 instance. We want to maximize the overall throughput by running 96 instances, with 48 on each NUMA node. Here's the instance_group property in model configuration file.

instance_group [
  {
    count: 48
    kind: KIND_CPU
    host_policy: "numa0"
  },
  {
    count: 48
    kind: KIND_CPU
    host_policy: "numa1"
  }
]

Optimizing for latency

To optimize for latency, we recommended setting the number of models equal to the number of cores divided by 4 (num_cores/4) and setting CAJAL_NUM_THREADS=4. In the same AWS instance, that would be translated to 12 models and CAJAL_NUM_THREADS=4.

References:

Model configuration - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md
Model architecture - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md
Model management - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md