Generative Models
Generative models, specifically GPT models, are useful for a wide variety of natural language tasks, and can also provide in-context learning and few-shot capabilities. NuPIC includes instruction-tuned generative models that can provide chat functionality.
Example use cases: semantic understanding, contextual summarization, dialogue generation
Example applications: summarization for news articles, assistant / chatbot, document and knowledge retrieval
NuPIC-GPT
NuPIC-GPT is our own generative AI model, optimized using neuroscience principles. It performs well on CPUs while being more accurate than Gemma.
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model with neuroscience optimizations |
Variants | nupic-gpt.7b-corti.v0 nupic-gpt.7b-dendi.v0 |
Context Length | 4,096 tokens |
Throughput | 0.797 sequences per second Results based on 96 model instances with one thread per model. One sequence = 512 input + 64 output tokens. |
Latency | 4.325 seconds per sequence Results based on a single model instance with 48 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 7 billion equivalent |
Memory Requirements | At least 12GB |
Training Data | Mix of public datasets, totaling 3–5 trillion tokens depending on variant. |
Gemma
Gemma is a group of lightweight large language models that share technology with the larger Gemini models. Gemma's smaller footprint allows it to run more efficiently and/or in resource-constrained environments. Please refer to instructions on downloading Gemma to the Model Library.
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model with multi-query attention, rotary positional encoding, GeGLU activations and RMSNorm |
Context Length | 8,192 tokens |
Throughput | 0.838 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens. |
Latency | 7.657 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 2 billion |
Memory Requirements | At least 8GB |
Knowledge Cutoff | November 2023 |
Gemma 2
Gemma 2 brings architectural improvements compared to its predecessor, as well as a new 9B variant trained using knowledge distillation. Please refer to instructions on downloading Gemma 2 to the Model Library.
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model with local sliding window and global attention, logit soft-capping, RMS norm, and grouped-query attention. |
Context Length | 8,192 tokens |
Throughput | 0.838 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens. |
Latency | 7.657 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 9 billion |
Memory Requirements | At least 18GB |
Llama 2
Llama 2 is a series of large language models. NuPIC offers the instruction-tuned “chat” variant with 7 billion parameters. Please refer to instructions on downloading Llama 2 to the Model Library.
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model |
Context Length | 4,096 tokens |
Latency | 10.280 seconds per sequence Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 7 billion |
Memory Requirements | At least 14GB |
Training Data | A new mix of publicly available online data, totaling 2 trillion tokens |
Knowledge Cutoff | September 2022 |
Llama 3
Llama 3 follows up on Llama 2 with improved performance and an increased vocabulary size. Please refer to instructions on downloading Llama 3 to the Model Library.
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model with grouped query attention, rotary positional encoding and SwiGLU activations |
Context Length | 8,192 tokens |
Throughput | 0.407 sequences per second Results based on 92 model instances with one thread per model, serving 92 concurrent clients. One sequence = 512 input + 64 output tokens. |
Latency | 6.294 seconds per sequence Results based on a single model instance with 32 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 8 billion |
Memory Requirements | At least 16GB |
Training Data | A new mix of publicly available online data, totaling >15 trillion tokens |
Knowledge Cutoff | March 2023 |
Zephyr
Zephyr is a fine-tuned version of Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).
Model Card | |
---|---|
Input | Text only |
Output | Text only |
Model Architecture | Auto-regressive transformer model with grouped query attention, sliding window attention, and byte-fallback BPE tokenizer |
Context Length | 4,096 tokens |
Throughput | 0.618 sequences per second Results based on 96 model instances with one thread per model, serving 96 concurrent clients. One sequence = 512 input + 64 output tokens. |
Latency | 33.334 seconds per sequence Results based on a single model instance with 64 threads, serving a single client. One sequence = 512 input + 64 output tokens. |
Parameters | 7 billion |
Memory Requirements | At least 28GB |
Training Data | A mix of publicly available and synthetic data, such as Web data and technical sources like books and code |
Updated 4 months ago