Log In

Generative Models

Generative models, specifically GPT models, are useful for a wide variety of natural language tasks, and can also provide in-context learning and few-shot capabilities. NuPIC includes instruction-tuned generative models that can provide chat functionality.

Example use cases: semantic understanding, contextual summarization, dialogue generation

Example applications: summarization for news articles, assistant / chatbot, document and knowledge retrieval


NuPIC-GPT is our own generative AI model, optimized using neuroscience principles. It performs well on CPUs while being more accurate than Gemma.

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model with neuroscience optimizations
Context Length4,096 tokens
Throughput0.797 sequences per second

Results based on 96 model instances with one thread per model. One sequence = 512 input + 64 output tokens.
Latency4.325 seconds per sequence

Results based on a single model instance with 48 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters7 billion equivalent
Memory RequirementsAt least 12GB
Training DataMix of public datasets, totaling 3–5 trillion tokens depending on variant.


Gemma is a group of lightweight large language models that share technology with the larger Gemini models. Gemma's smaller footprint allows it to run more efficiently and/or in resource-constrained environments. Please refer to instructions on downloading Gemma to the Model Library.

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model with multi-query attention, rotary positional encoding, GeGLU activations and RMSNorm
Context Length8,192 tokens
Throughput0.838 sequences per second

Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency7.657 seconds per sequence

Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters2 billion
Memory RequirementsAt least 8GB
Knowledge CutoffNovember 2023

Gemma 2

Gemma 2 brings architectural improvements compared to its predecessor, as well as a new 9B variant trained using knowledge distillation. Please refer to instructions on downloading Gemma 2 to the Model Library.

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model with local sliding window and global attention, logit soft-capping, RMS norm, and grouped-query attention.
Context Length8,192 tokens
Throughput0.838 sequences per second

Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency7.657 seconds per sequence

Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters9 billion
Memory RequirementsAt least 18GB

Llama 2

Llama 2 is a series of large language models. NuPIC offers the instruction-tuned “chat” variant with 7 billion parameters. Please refer to instructions on downloading Llama 2 to the Model Library.

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model
Context Length4,096 tokens
Latency10.280 seconds per sequence

Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters7 billion
Memory RequirementsAt least 14GB
Training DataA new mix of publicly available online data, totaling 2 trillion tokens
Knowledge CutoffSeptember 2022

Llama 3

Llama 3 follows up on Llama 2 with improved performance and an increased vocabulary size. Please refer to instructions on downloading Llama 3 to the Model Library.

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model with grouped query attention, rotary positional encoding and SwiGLU activations
Context Length8,192 tokens
Throughput0.407 sequences per second

Results based on 92 model instances with one thread per model, serving 92 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency6.294 seconds per sequence

Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters8 billion
Memory RequirementsAt least 16GB
Training DataA new mix of publicly available online data, totaling >15 trillion tokens
Knowledge CutoffMarch 2023


Zephyr is a fine-tuned version of Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).

Model Card
InputText only
OutputText only
Model ArchitectureAuto-regressive transformer model with grouped query attention, sliding window attention, and byte-fallback BPE tokenizer
Context Length4,096 tokens
Throughput0.618 sequences per second

Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency33.334 seconds per sequence

Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters7 billion
Memory RequirementsAt least 28GB
Training DataA mix of publicly available and synthetic data, such as Web data and technical sources like books and code