Optimize Throughput and Latency
The NuPIC Inference Server offers flexible model performance tuning options through the model configuration file. Usually there are two use cases, optimizing for throughput or optimizing for latency. Additionally, response caching can potentially improve both throughput and latency
Optimizing for throughput
For better throughput, we want more model instances, with each instance running on fewer threads. We therefore recommend the following configurations:
-
In the
config.pbtxt
for each model , set theinstance_group.count
property to the same number of cores in the deployment server. The configuration property controls how many model instances will be running concurrently. -
For ONNX-based models, set
intra_op_thread_count
to 1 for single-threaded operation on each model instance. For PyTorch-based models, setNUM_THREADS
to 0. Avoid settingNUM_THREADS
to 1 as this mode of single-threading introduces additional overheads. -
Utilize NUMA node optimization when applicable.
On larger machines with more than one NUMA node (determined by runninglscpu
), further optimization can be achieved by confining each model instance to a single NUMA node. Follow these steps:i. Run
lscpu
, and note the "NUMA" section.ii. If there is just 1 NUMA node, then there is nothing to be done.
iii. For each NUMA node, note the list of CPUs. Usually just the first set of CPUs in the list corresponds to the physical cores, use only those. For example, if it says
NUMA node0 CPU(s): 0-47,96-143
, use only CPUs 0-47.iv. For each NUMA node, add a
host-policy
through the--additional-args
flag in thenupic_inference.sh
. This consists of two parameters for each NUMA node:--host-policy=numa0,numa-node=0 --host-policy=numa0,cpu-cores=0-47
.v. In the
config.pbtxt
for the model, create a separateinstance_group
for each NUMA node and set itshost_policy
to that node's policy name. Make sure thecount
does not exceed each NUMA node's number of physical cores. -
Utilize dynamic batching
Putting everything together, let's go over an example of 96 cores with 2 NUMA nodes, such as an AWS c7i.metal-48xl
EC2 instance. We want to maximize the overall throughput by running 96 instances, with 48 on each NUMA node. Here are the relevant snippets in the configuration file.
instance_group [
{
count: 48
kind: KIND_CPU
host_policy: "numa0"
},
{
count: 48
kind: KIND_CPU
host_policy: "numa1"
}
]
..
# For ONNX-based models
parameters [
{ key: "intra_op_thread_count" value: { string_value: "1" } },
{ key: "share_session" value: { string_value: "true" } }
]
# For PyTorch-based models
parameters: [
{
key: "EXECUTION_ENV_PATH"
value: {
string_value: "/models/envs/python-backend-deps-v0.1-rc.tar.gz"
}
},
{
key: "NUM_THREADS"
value: {
string_value: "0"
}
}
]
Optimizing for latency
Conversely, to optimize for latency, we need to have fewer model instances, with each model instance running in a multithreaded fashion.
For best latency, use a single model instance and set intra_op_thread_count
or NUM_THREADS
for ONNX and PyTorch models, respectively, to "0", which instructs the runtime to use as many threads as it can. Alternatively, you can specify several model instances, each with several threads, such that the total models' threads do not exceed the number of physical cores. On servers with multiple NUMA nodes, see above for NUMA configuration, and for each instance_group
the models' threads should not exceed the number of cores on a single NUMA node.
Additionally, for GPTs, there is a CPU_LATENCY_OPTIMIZE
parameter that can be set to "1" for better latencies on CPUs with AVX-512-VNNI or AMX.
I feel the need... the need for speed! 🛩️
Beyond the configurations in
config.pbtxt
, enabling prompt lookup decoding in the GPT inference parameters can also help with latency. Using GPT output streaming also helps with perceived latency.
Response Caching
Another performance tuning option offered by the Inference Server is Response Caching. If our application receives numerous requests, using the response cache will improve latencies for repeated requests.
We can enable it by adding the --enable-caching
flag when launching the Inference Server. By doing so, the server maintains a cache in memory, storing both the requests and corresponding responses. Subsequent requests matching those in the cache are fulfilled using this stored response rather than invoking the model again, resulting in faster response times.
"Caching" out performance benefits
The cache default size is 1GB, and can be set by adding
--cache-size [bytes]
when launching the Inference Server.All models have caching enabled by default (except streaming models), and will use the cache when enabled in the server, but you can manually disable the cache usage for each single model by setting the
response_cache.enable
tofalse
on the respectiveconfig.pbtxt
file.
However, it's important to note that using the response cache incurs a slight performance overhead on all requests. Therefore, it's crucial to evaluate whether the benefits of enabling the cache outweigh this overall performance impact. Since the performance gains and losses are contingent upon individual setups and models, we highly recommend thoroughly testing your specific scenario to determine the optimal configuration.
References:
- Model configuration - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md
- Model architecture - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md
- Model management - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md
- Response caching - https://github.com/triton-inference-server/server/blob/main/docs/user_guide/response_cache.md
Updated 6 months ago