In order to run a model on the NuPIC Inference Server, a configuration file for the model must be provided. This config.pbtxt file controls how a model will be loaded and executed during inference time. The configuration file is typically located within each model subdirectory. For example:

nupic/inference/models/nupic-sbert.base-v3
├── 1
│   └── model.onnx
└── config.pbtxt

Detailed information about the model configuration file can be found on the Triton Server’s Model Configuration page .

Here are several notable configuration properties for NuPIC Inference Server’s users.

Model Name

The model name can be specified using the name property. For example, the following snippet indicates that the configuration file is for a model named nupic-sbert.base-v3.

The model name is crucial as the model artifacts, including the configuration file, should be placed in a directory with the same name within the model repository. Using the same example, the model artifacts should be stored under nupic/inference/models/nupic-sbert.base-v3/.

name: "nupic-sbert.base-v3"

Model platform

The platform property in the model configuration file that specifies the Triton server backend used to execute this model. For example, value onnxruntime_onnx means to use ONNX runtime backend; value python means to run Python backend. Refer to the Backend-Platform Support Matrix for all backends Triton server currently supported.

Instance group

The instance_group property contains several configurations for model instances. Specifically, the count value controls how many instances of the same model will run concurrently; the kind value controls the type of hardware model instances will run on. The following snippet will spin up 96 concurrent model instances, all running on CPUs.

instance_group [  
  {  
    count: 96  
    kind: KIND_CPU  
  }  
]

There are many compelling reasons to run models on CPU, but you can still choose to run a model on GPU by specifying kind: KIND_GPU.

Inputs and outputs

A model's configuration file must contain input and output properties. They specify the input and output tensors of the model respectively, as well as their dimensions. For example, the following configuration snippet specifies a text input and text output. -1 denotes variably-sized dimensions.

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING  
    dims: [-1]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_STRING  
    dims: [-1]
  }
]

Parameters

The parameters property in the model configuration file is a dictionary that contains key-value pairs for additional configuration.

For models running with ONNX backend, that is, those models whose platform property value is onnxruntime_onnx:

intra_op_thread_count controls the number of threads in ONNX runtime intra-op thread pool. This influences the extent of parallelism.
share_session controls whether all instances of the same model will share the same ONNX runtime session.

The following configuration snippet gives 4 threads for the ONNX intra operator, and shares ONNX runtime session object among all instances of the same model.

parameters [  
  { key: "intra_op_thread_count" value: { string_value: "4" } },  
  { key: "share_session" value: { string_value: "true" } }  
]

For models running with Python backend, that is, those models whose platform property value is python:

EXECUTION_ENV_PATH points to a packed conda environment for the model to run. It should contain all the runtime dependencies for the model.
NUM_THREADS controls the number of threads for intra-op parallelism on CPU. Setting the value to 0 or not providing this parameter at all means the model will use as many threads as available/needed.

The following configuration snippet gives 4 threads for intra-operator and specifies the location for the packed conda environment.

parameters: [  
  {  
    key: "EXECUTION_ENV_PATH"  
    value: {  
        string_value: "/models/envs/python-backend-deps-v0.tar.gz"  
    }  
  },  
  {  
    key: "NUM_THREADS"  
    value: {  
        string_value: "4"  
    }  
  }  
]

CPU_LATENCY_OPTIMIZE. When set to "1", this can potentially provide better latencies on CPUs with support for AVX-512-VNNI and AMX.

    {
    key: "CPU_LATENCY_OPTIMIZE"
    value: {
        string_value: "0"
    }

Managing Multiple Streams

The NuPIC Inference Server automatically handles multiple requests. If the flow rate is slower than the incoming request speed, requests will be queued. To further improve performance, the number of cores should be considered for model configuration.