CodeBase-280B: Next-Gen MoE LLM Launched
๐ Introducing CodeBase-280B: Phase 1
Hey everyone,
Weโre excited to share the first phase of CodeBase-280B, our next-generation language model built for performance, scalability, and advanced AI capabilities. Hereโs what makes it special:
๐ก Key Highlights
- Mixture of Experts (MoE): 128 experts with 9 active per token, meaning the model only uses the parts it needs for efficiency and speed.
- Massive Context Window: Handles up to 384,000 tokens at once, allowing it to understand extremely long documents or conversations.
- Compressed Attention: Optimized memory usage with partial KV sharing to keep inference fast.
- Efficient Inference: 8-bit quantization with KV cache compression reduces memory requirements without sacrificing quality.
- Parallel & Distributed: Designed for multi-GPU setups with support for distributed training and mixed precision (bfloat16) for optimal performance.
๐ Specs at a Glance
- Hidden Size: 7,168
- Layers: 75
- Attention Heads: 52
- Experts: 128 total, 9 active
- Context Window: 384,000 tokens
- Parameters: ~280B total (~18B active at a time)
- Vocabulary: 51,200 tokens
๐ ๏ธ Project Structure & Usage
CodeBase-280B includes everything you need to train, test, and run inference on the model, including:
- Transformer architecture and MoE modules
- Compressed attention and RoPE positional encoding
- Quantization utilities for memory-efficient inference
- Training scripts with multi-GPU support
- Open-source configuration for customization
Installation is simple: clone the repo and install dependencies via pip. You can train, generate text, benchmark, or run tests with provided scripts.