Version 2.2.0

New Features

NuPIC-GPT Performance

We made nupic-gpt.7b-corti run even faster, right out of the box. You can try it with our existing examples.

Recommendation System

We now include a tutorial on using NuPIC-BERT models as part of a recommendation system.

Version 2.1.0

New Features

Gemma 2 Added to Model Library

The NuPIC Model Library now includes the latest open source model from Google: gemma2.it.9b!

Prompt Lookup Decoding for Streaming Models

Prompt lookup decoding now works with output streaming. This gives potentially reduced latencies for a better user experience.

Easier NUMA Policies

Now you can apply NUMA policies straight from the Inference Server CLI, without having to dive into code.

GPT Fine-Tuning

You can now tune GPT models with your own data. Check out our tutorial!

Version 2.0.1

New Features

NuPIC-GPT Optimizations

We optimized the memory usage of nupic-gpt.7b-corti.v0. The model now requires ~12GB to load and run.

Prompt Lookup Decoding

Prompt lookup decoding is now available for GPT models in NuPIC. This speeds up inference by producing candidate output tokens from the prompt itself! This currently works with non-streaming outputs.

More Optimizations for CPU

We can now make GPT models run with even better latencies on CPUs with AVX-512-VNNI or AMX.

Version 2.0.0 🎉

New Features

New NuPIC-GPT Model

We're proud to introduce nupic-gpt.7b-dendi.v0. This is our second NuPIC-GPT model, now with a fresh set of neuroscience-based optimizations!

Llama 3 Added to Model Library

Llama 3 brings improved performance and a larger vocabulary compared to Llama 2.

Response Caching

Input-output pairs can now be cached in memory for each model, allowing for better performance on frequently repeated queries.

Version 1.3.0

New Features

New Model Nomenclature

We've renamed the models in the Model Library for better clarity. Please see the Model Library page for the updated names and what they mean.

GPT Streaming

We added output streaming support for GPT models. Model outputs can now be printed as they are generated for a better user experience.

Performance Dashboard

The NuPIC Performance Dashboard now launches automatically together with the Inference Server. There's no longer a need to launch it separately. The documentation page for this has been updated accordingly. This requires the Docker Compose plugin, which has been added to System Requirements.

Faster and More Secure Connections

You can now connect to the Inference Server using GRPC and HTTPS over SSL. Please see instructions here.

Known Issues

The nupic-gpt.7b-corti.v0 model requires ~64GB of memory during loading. After loading is complete, the memory footprint sits at ~12GB.

Version 1.2.0

New Features

NuPIC-GPT

We've added the nupic-gpt.zs-7b model, which is a CPU optimized GPT model. You can test it out using the GPT Chat and other GPT-related examples.

Bring Your Own Model (BYOM)

We've added to the Training Module the ability to easily import a model and package it for using in the NuPIC Inference Server. There is an example in examples/byom which imports bert-large-cased. The resulting model can then be copied to the inference/models directory for use in the inference server.

Currently BYOM will work for the following HF models,

bert-large-cased
sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/multi-qa-mpnet-base-dot-v1
sentence-transformers/all-mpnet-base-v2
sentence-transformers/all-roberta-large-v1
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Known Issues

GPT on CPU on a GPU-enabled machine. The Zephyr, LLama2, and Gemma models can be configured (config.pbtxt) to run on a CPU or GPU. However, the models fail if configured for CPU on a GPU-enabled machine.
The NuPIC client library supports client side tokenization for BERT models, but this does not work for BYOM imported models. To perform inference with a BYOM, specify the model name with the -wtokenizer extension.
The nupic-gpt.zs-7b model uses approximately 11GB when fully loaded, but the process of loading requires 64GB of RAM at its peak. With less than 64GB, the model will fail to load and the client will suggest the model name was incorrect.