Release Notes
Version 2.2.0
New Features
NuPIC-GPT Performance
We made nupic-gpt.7b-corti
run even faster, right out of the box. You can try it with our existing examples.
Recommendation System
We now include a tutorial on using NuPIC-BERT models as part of a recommendation system.
Version 2.1.0
New Features
Gemma 2 Added to Model Library
The NuPIC Model Library now includes the latest open source model from Google: gemma2.it.9b!
Prompt Lookup Decoding for Streaming Models
Prompt lookup decoding now works with output streaming. This gives potentially reduced latencies for a better user experience.
Easier NUMA Policies
Now you can apply NUMA policies straight from the Inference Server CLI, without having to dive into code.
GPT Fine-Tuning
You can now tune GPT models with your own data. Check out our tutorial!
Version 2.0.1
New Features
NuPIC-GPT Optimizations
We optimized the memory usage of nupic-gpt.7b-corti.v0. The model now requires ~12GB to load and run.
Prompt Lookup Decoding
Prompt lookup decoding is now available for GPT models in NuPIC. This speeds up inference by producing candidate output tokens from the prompt itself! This currently works with non-streaming outputs.
More Optimizations for CPU
We can now make GPT models run with even better latencies on CPUs with AVX-512-VNNI or AMX.
Version 2.0.0 🎉
New Features
New NuPIC-GPT Model
We're proud to introduce nupic-gpt.7b-dendi.v0. This is our second NuPIC-GPT model, now with a fresh set of neuroscience-based optimizations!
Llama 3 Added to Model Library
Llama 3 brings improved performance and a larger vocabulary compared to Llama 2.
Response Caching
Input-output pairs can now be cached in memory for each model, allowing for better performance on frequently repeated queries.
Version 1.3.0
New Features
New Model Nomenclature
We've renamed the models in the Model Library for better clarity. Please see the Model Library page for the updated names and what they mean.
GPT Streaming
We added output streaming support for GPT models. Model outputs can now be printed as they are generated for a better user experience.
Performance Dashboard
The NuPIC Performance Dashboard now launches automatically together with the Inference Server. There's no longer a need to launch it separately. The documentation page for this has been updated accordingly. This requires the Docker Compose plugin, which has been added to System Requirements.
Faster and More Secure Connections
You can now connect to the Inference Server using GRPC and HTTPS over SSL. Please see instructions here.
Known Issues
- The nupic-gpt.7b-corti.v0 model requires ~64GB of memory during loading. After loading is complete, the memory footprint sits at ~12GB.
Version 1.2.0
New Features
NuPIC-GPT
We've added the nupic-gpt.zs-7b model, which is a CPU optimized GPT model. You can test it out using the GPT Chat and other GPT-related examples.
Bring Your Own Model (BYOM)
We've added to the Training Module the ability to easily import a model and package it for using in the NuPIC Inference Server. There is an example in examples/byom which imports bert-large-cased. The resulting model can then be copied to the inference/models directory for use in the inference server.
Currently BYOM will work for the following HF models,
- bert-large-cased
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multi-qa-mpnet-base-dot-v1
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/all-roberta-large-v1
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Known Issues
- GPT on CPU on a GPU-enabled machine. The Zephyr, LLama2, and Gemma models can be configured (
config.pbtxt
) to run on a CPU or GPU. However, the models fail if configured for CPU on a GPU-enabled machine. - The NuPIC client library supports client side tokenization for BERT models, but this does not work for BYOM imported models. To perform inference with a BYOM, specify the model name with the -wtokenizer extension.
- The nupic-gpt.zs-7b model uses approximately 11GB when fully loaded, but the process of loading requires 64GB of RAM at its peak. With less than 64GB, the model will fail to load and the client will suggest the model name was incorrect.
Updated 5 months ago