Generative models, specifically GPT models, are useful for a wide variety of natural language tasks, and can also provide in-context learning and few-shot capabilities. NuPIC includes instruction-tuned generative models that can provide chat functionality.

Example use cases: semantic understanding, contextual summarization, dialogue generation

Example applications: summarization for news articles, assistant / chatbot, document and knowledge retrieval

NuPIC-GPT

NuPIC-GPT is our own generative AI model, optimized using neuroscience principles. It performs well on CPUs while being more accurate than Gemma.

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model with neuroscience optimizations
Variants	nupic-gpt.7b-corti.v0 nupic-gpt.7b-dendi.v0
Context Length	4,096 tokens
Throughput	0.797 sequences per second Results based on 96 model instances with one thread per model. One sequence = 512 input + 64 output tokens.
Latency	4.325 seconds per sequence Results based on a single model instance with 48 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	7 billion equivalent
Memory Requirements	At least 12GB
Training Data	Mix of public datasets, totaling 3–5 trillion tokens depending on variant.

Gemma

Gemma is a group of lightweight large language models that share technology with the larger Gemini models. Gemma's smaller footprint allows it to run more efficiently and/or in resource-constrained environments. Please refer to instructions on downloading Gemma to the Model Library.

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model with multi-query attention, rotary positional encoding, GeGLU activations and RMSNorm
Context Length	8,192 tokens
Throughput	0.838 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency	7.657 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	2 billion
Memory Requirements	At least 8GB
Knowledge Cutoff	November 2023

Gemma 2

Gemma 2 brings architectural improvements compared to its predecessor, as well as a new 9B variant trained using knowledge distillation. Please refer to instructions on downloading Gemma 2 to the Model Library.

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model with local sliding window and global attention, logit soft-capping, RMS norm, and grouped-query attention.
Context Length	8,192 tokens
Throughput	0.838 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency	7.657 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	9 billion
Memory Requirements	At least 18GB

Llama 2

Llama 2 is a series of large language models. NuPIC offers the instruction-tuned “chat” variant with 7 billion parameters. Please refer to instructions on downloading Llama 2 to the Model Library.

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model
Context Length	4,096 tokens
Latency	10.280 seconds per sequence Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	7 billion
Memory Requirements	At least 14GB
Training Data	A new mix of publicly available online data, totaling 2 trillion tokens
Knowledge Cutoff	September 2022

Llama 3

Llama 3 follows up on Llama 2 with improved performance and an increased vocabulary size. Please refer to instructions on downloading Llama 3 to the Model Library.

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model with grouped query attention, rotary positional encoding and SwiGLU activations
Context Length	8,192 tokens
Throughput	0.407 sequences per second Results based on 92 model instances with one thread per model, serving 92 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency	6.294 seconds per sequence Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	8 billion
Memory Requirements	At least 16GB
Training Data	A new mix of publicly available online data, totaling >15 trillion tokens
Knowledge Cutoff	March 2023

Zephyr

Zephyr is a fine-tuned version of Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).

Model Card
Input	Text only
Output	Text only
Model Architecture	Auto-regressive transformer model with grouped query attention, sliding window attention, and byte-fallback BPE tokenizer
Context Length	4,096 tokens
Throughput	0.618 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens.
Latency	33.334 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens.
Parameters	7 billion
Memory Requirements	At least 28GB
Training Data	A mix of publicly available and synthetic data, such as Web data and technical sources like books and code