Researchers at Unsloth collaborated with NVIDIA to optimize Large Language Model (LLM) training by 25% using three key changes. The first optimization involves caching metadata and mask structures for repeated use in transformer layers, resulting in a time savings of approximately 206-370 milliseconds per step on larger models.

The second change uses double buffering to hide activation reload latency behind computation, reducing the overhead of checkpointing activations during backward passes. This improvement scales with batch size, sequence length, and other factors affecting data movement and computation.

The third optimization is specific to MoE routing in PyTorch-based GPT-OSS implementations, where grouping tokens by expert and reusing offsets instead of asking for dynamic token lists reduces dynamic query overhead from num_experts to roughly 1. This change leads to 10-15% speedups on GPT-OSS configurations.

These optimizations apply to various parts of the stack and solve a common problem: reducing overhead in glue code around main kernels as they get faster. The improvements compose conceptually, with each optimization addressing specific pain points in LLM training.