Guides
Log In
Guides

Optimize Throughput and Latency

The NuPIC Inference Server offers flexible model performance tuning options through the model configuration file. Usually there are two use cases, optimizing for throughput or optimizing for latency. Additionally, response caching can potentially improve both throughput and latency

Optimizing for throughput

For better throughput, we want more model instances, with each instance running on fewer threads. We therefore recommend the following configurations:

  1. In the config.pbtxt for each model , set the instance_group.count property to the same number of cores in the deployment server. The configuration property controls how many model instances will be running concurrently.

  2. For ONNX-based models, set intra_op_thread_count to 1 for single-threaded operation on each model instance. For PyTorch-based models, set NUM_THREADS to 0. Avoid setting NUM_THREADS to 1 as this mode of single-threading introduces additional overheads.

  3. Utilize NUMA node optimization when applicable.
    On larger machines with more than one NUMA node (determined by running lscpu), further optimization can be achieved by confining each model instance to a single NUMA node. Follow these steps:

    i. Run lscpu, and note the "NUMA" section.

    ii. If there is just 1 NUMA node, then there is nothing to be done.

    iii. For each NUMA node, note the list of CPUs. Usually just the first set of CPUs in the list corresponds to the physical cores, use only those. For example, if it says NUMA node0 CPU(s): 0-47,96-143, use only CPUs 0-47.

    iv. For each NUMA node, add a host-policy through the --additional-args flag in the nupic_inference.sh. This consists of two parameters for each NUMA node: --host-policy=numa0,numa-node=0 --host-policy=numa0,cpu-cores=0-47.

    v. In the config.pbtxt for the model, create a separate instance_group for each NUMA node and set its host_policy to that node's policy name. Make sure the count does not exceed each NUMA node's number of physical cores.

  4. Utilize dynamic batching

Putting everything together, let's go over an example of 96 cores with 2 NUMA nodes, such as an AWS c7i.metal-48xl EC2 instance. We want to maximize the overall throughput by running 96 instances, with 48 on each NUMA node. Here are the relevant snippets in the configuration file.

instance_group [
  {
    count: 48
    kind: KIND_CPU
    host_policy: "numa0"
  },
  {
    count: 48
    kind: KIND_CPU
    host_policy: "numa1"
  }
]

..

# For ONNX-based models
parameters [  
  { key: "intra_op_thread_count" value: { string_value: "1" } },  
  { key: "share_session" value: { string_value: "true" } }  
]

# For PyTorch-based models
parameters: [
  {
    key: "EXECUTION_ENV_PATH"
    value: {
        string_value: "/models/envs/python-backend-deps-v0.1-rc.tar.gz"
    }
  },
  {
    key: "NUM_THREADS"
    value: {
        string_value: "0"
    }
  }
]

Optimizing for latency

Conversely, to optimize for latency, we need to have fewer model instances, with each model instance running in a multithreaded fashion.

For best latency, use a single model instance and set intra_op_thread_count or NUM_THREADS for ONNX and PyTorch models, respectively, to "0", which instructs the runtime to use as many threads as it can. Alternatively, you can specify several model instances, each with several threads, such that the total models' threads do not exceed the number of physical cores. On servers with multiple NUMA nodes, see above for NUMA configuration, and for each instance_group the models' threads should not exceed the number of cores on a single NUMA node.

Additionally, for GPTs, there is a CPU_LATENCY_OPTIMIZE parameter that can be set to "1" for better latencies on CPUs with AVX-512-VNNI or AMX.

📘

I feel the need... the need for speed! 🛩️

Beyond the configurations in config.pbtxt, enabling prompt lookup decoding in the GPT inference parameters can also help with latency. Using GPT output streaming also helps with perceived latency.

Response Caching

Another performance tuning option offered by the Inference Server is Response Caching. If our application receives numerous requests, using the response cache will improve latencies for repeated requests.

We can enable it by adding the --enable-caching flag when launching the Inference Server. By doing so, the server maintains a cache in memory, storing both the requests and corresponding responses. Subsequent requests matching those in the cache are fulfilled using this stored response rather than invoking the model again, resulting in faster response times.

📘

"Caching" out performance benefits

The cache default size is 1GB, and can be set by adding --cache-size [bytes] when launching the Inference Server.

All models have caching enabled by default (except streaming models), and will use the cache when enabled in the server, but you can manually disable the cache usage for each single model by setting the response_cache.enable to false on the respective config.pbtxt file.

However, it's important to note that using the response cache incurs a slight performance overhead on all requests. Therefore, it's crucial to evaluate whether the benefits of enabling the cache outweigh this overall performance impact. Since the performance gains and losses are contingent upon individual setups and models, we highly recommend thoroughly testing your specific scenario to determine the optimal configuration.

References: