Forge is a CLI swarm agent system that automatically generates optimized GPU kernels from any PyTorch or HuggingFace model. It achieves up to 5× faster inference than torch.compile with 97.6% correctness.

Forge is a swarm-based kernel optimizer that accelerates GPU inference for any model. Enter a HuggingFace model ID and Forge automatically generates optimized CUDA/Triton kernels for every layer.

The system runs 32 parallel Coder+Judge agent pairs that compete to find the fastest kernel implementation. Each agent explores optimization strategies including tensor core utilization, memory coalescing, and kernel fusion. This achieves up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness.

Forge uses inference-time scaling powered by a fine-tuned and optimized NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second. This enables deep exploration of the optimization space in minutes instead of hours. The system includes evolutionary optimization with MAP-Elites archive (36 cells) and island model with 4 specialized populations and migration.

The product achieves significant performance improvements across various models including Llama-3.1-8B (5.16× faster), Qwen2.5-7B (4.23× faster), Mistral-7B (3.38× faster), Phi-3-mini (2.75× faster), SDXL UNet (2.87× faster), Whisper-large (2.63× faster), and BERT-large (2.43× faster).

Forge is designed for developers working with GPU-accelerated deep learning models who need optimized inference performance. It supports inputs from PyTorch (any nn.Module or function), KernelBench (250+ benchmark tasks), and HuggingFace (any model ID).

Forge CLI

Forge CLI

Key Features

Publisher

Tech Stack

Use Cases

Who is this for?

Comments