

Forge is a swarm-based kernel optimizer that accelerates GPU inference for any model. Enter a HuggingFace model ID and Forge automatically generates optimized CUDA/Triton kernels for every layer.
The system runs 32 parallel Coder+Judge agent pairs that compete to find the fastest kernel implementation. Each agent explores optimization strategies including tensor core utilization, memory coalescing, and kernel fusion. This achieves up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness.
Forge uses inference-time scaling powered by a fine-tuned and optimized NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second. This enables deep exploration of the optimization space in minutes instead of hours. The system includes evolutionary optimization with MAP-Elites archive (36 cells) and island model with 4 specialized populations and migration.
The product achieves significant performance improvements across various models including Llama-3.1-8B (5.16× faster), Qwen2.5-7B (4.23× faster), Mistral-7B (3.38× faster), Phi-3-mini (2.75× faster), SDXL UNet (2.87× faster), Whisper-large (2.63× faster), and BERT-large (2.43× faster).
Forge is designed for developers working with GPU-accelerated deep learning models who need optimized inference performance. It supports inputs from PyTorch (any nn.Module or function), KernelBench (250+ benchmark tasks), and HuggingFace (any model ID).
admin
Forge is designed for developers working with GPU-accelerated deep learning models who need optimized inference performance. It targets users who work with PyTorch models, HuggingFace model IDs, and require high-performance GPU kernels for models like Llama, Mistral, Qwen, Phi-3, SDXL, Whisper, and BERT. The product serves developers needing to outperform torch.compile(mode='max-autotune') with specialized optimizations for datacenter GPUs including B200, H100, and H200 architectures.