Configure a Model
In order to run a model on the NuPIC Inference Server, a configuration file for the model must be provided. This config.pbtxt
file controls how a model will be loaded and executed during inference time. The configuration file is typically located within each model subdirectory. For example:
nupic/inference/models/nupic-sbert.base-v3
├── 1
│ └── model.onnx
└── config.pbtxt
Detailed information about the model configuration file can be found on the Triton Server’s Model Configuration page .
Here are several notable configuration properties for NuPIC Inference Server’s users.
Model Name
The model name can be specified using the name
property. For example, the following snippet indicates that the configuration file is for a model named nupic-sbert.base-v3
.
The model name is crucial as the model artifacts, including the configuration file, should be placed in a directory with the same name within the model repository. Using the same example, the model artifacts should be stored under nupic/inference/models/nupic-sbert.base-v3/
.
name: "nupic-sbert.base-v3"
Model platform
The platform
property in the model configuration file that specifies the Triton server backend used to execute this model. For example, value onnxruntime_onnx
means to use ONNX runtime backend; value python
means to run Python backend. Refer to the Backend-Platform Support Matrix for all backends Triton server currently supported.
Instance group
The instance_group
property contains several configurations for model instances. Specifically, the count
value controls how many instances of the same model will run concurrently; the kind
value controls the type of hardware model instances will run on. The following snippet will spin up 96 concurrent model instances, all running on CPUs.
instance_group [
{
count: 96
kind: KIND_CPU
}
]
There are many compelling reasons to run models on CPU, but you can still choose to run a model on GPU by specifying kind: KIND_GPU
.
Inputs and outputs
A model's configuration file must contain input
and output
properties. They specify the input and output tensors of the model respectively, as well as their dimensions. For example, the following configuration snippet specifies a text input and text output. -1
denotes variably-sized dimensions.
input [
{
name: "TEXT"
data_type: TYPE_STRING
dims: [-1]
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_STRING
dims: [-1]
}
]
Parameters
The parameters
property in the model configuration file is a dictionary that contains key-value pairs for additional configuration.
For models running with ONNX backend, that is, those models whose platform property value is onnxruntime_onnx
:
intra_op_thread_count
controls the number of threads in ONNX runtime intra-op thread pool. This influences the extent of parallelism.share_session
controls whether all instances of the same model will share the same ONNX runtime session.
The following configuration snippet gives 4 threads for the ONNX intra operator, and shares ONNX runtime session object among all instances of the same model.
parameters [
{ key: "intra_op_thread_count" value: { string_value: "4" } },
{ key: "share_session" value: { string_value: "true" } }
]
For models running with Python backend, that is, those models whose platform property value is python
:
EXECUTION_ENV_PATH
points to a packed conda environment for the model to run. It should contain all the runtime dependencies for the model.NUM_THREADS
controls the number of threads for intra-op parallelism on CPU. Setting the value to 0 or not providing this parameter at all means the model will use as many threads as available/needed.
The following configuration snippet gives 4 threads for intra-operator and specifies the location for the packed conda environment.
parameters: [
{
key: "EXECUTION_ENV_PATH"
value: {
string_value: "/models/envs/python-backend-deps-v0.tar.gz"
}
},
{
key: "NUM_THREADS"
value: {
string_value: "4"
}
}
]
CPU_LATENCY_OPTIMIZE
. When set to"1"
, this can potentially provide better latencies on CPUs with support for AVX-512-VNNI and AMX.
{
key: "CPU_LATENCY_OPTIMIZE"
value: {
string_value: "0"
}
Managing Multiple Streams
The NuPIC Inference Server automatically handles multiple requests. If the flow rate is slower than the incoming request speed, requests will be queued. To further improve performance, the number of cores should be considered for model configuration.
Updated 6 months ago