Running large language model (LLM) inference is both simple to develop and easy to manage on the NuPIC Inference Server. The NuPIC Inference Server provides a Python inference client for application development, facilitating seamless integration into various software environments and workflows. Alternatively, it's also possible to interact with the Inference Server using our REST API, but we'll just focus on the Python client here. For simplicity, we will assume the case where the Python client and Inference Server are running on the same machine.

Prerequisites

First, make sure the Inference Server is installed and running. Next, check that Python client dependencies have been installed, and that your virtual environment has been activated.

Importing the ClientFactory

We start by importing the ClientFactory class into your Python script or notebook:

from nupic.client.inference_client import ClientFactory

Defining constants

Next, define some constants that we will be using later.

URL = "localhost:8000" # Change this if the Inference Server is running remotely
PROTOCOL = "http"
CERTIFICATES = {}
CONCURRENCY = 1

MODEL_NAME = "nupic-sbert.base-v3" # Select a model from the Model Library
INPUTS = [
    "Numenta is the global leader in deploying large AI models on CPU.",
    "Neuroscience-based AIs are efficient.",
    "I love CPUs!",
]

📘
More connections
The CONCURRENCY variable specifies the number of connections that the Python client establishes with the Inference Server at any one time. Higher concurrency values can result in greater throughput. Acceptable concurrency values range from 1 to 4 (inclusive). This is not to be confused with model concurrency.

Instantiating Python Client

Now we want to create an instance of the Python client by passing in the constants defined above. Running this for the first time will take a little longer as the model gets loaded into memory at the Inference Server.

nupic_client = ClientFactory.get_client(
    MODEL_NAME, URL, PROTOCOL, CERTIFICATES, concurrency=CONCURRENCY
)

Run Inference

Finally, we can pass our inputs to the model for inference.

encodings = nupic_client.infer(INPUTS)["encodings"]
print(encodings)

💡 The .infer() method works for NuPIC models that come bundled with a tokenizer. If you want to use your own tokenizer, you should pass it to the .infer_from_tokens() method instead. This should be used in conjunction with a NuPIC model without a tokenizer. Please see nupic.examples/examples/user_side_tokenization/ for details.

The encodings should look similar to this:

[[ 0.17598186  0.43633145  0.12241551 ... -0.26492846 -0.57648134
  -0.23418485]
 [-1.0308936   1.6698977   0.26773554 ... -0.3904295  -0.37393257
   0.48629946]
 [ 0.18071954  0.35302585  0.00806469 ... -0.26461816 -1.3490938
  -0.9053489 ]]

Examining the Output

Let's dig a little deeper into the encodings.

print(type(encodings))
print(encodings.shape)

Output:

<class 'numpy.ndarray'>
(3, 768)

We can see the encodings are returned as a numpy array. The 0th dimension of the array corresponds to the three input strings that we had originally passed to the model. For BERT models, the encoding of each string is a vector of length 768.

These vectors can subsequently be fed into a simple downstream model (e.g., logistic regression, clustering) for your use case.

Remember, the inference was done entirely on CPU only. Numenta models have been specially optimized to run efficiently on CPUs. How much more efficient? Check out our benchmark!

Run Inference on a Model

Prerequisites

Importing the ClientFactory

Defining constants

📘
More connections

Instantiating Python Client

Run Inference

Examining the Output

Prerequisites

Importing the ClientFactory

Defining constants

📘More connections

Instantiating Python Client

Run Inference

Examining the Output

📘
More connections