Fine-Tuning GLM-4.7-Flash (30B MoE) with LoRA on a Single 48GB GPU
TL;DR
I fine-tuned a 30B parameter model (GLM-4.7-Flash) on a single 48GB GPU. Everyone said you need 60GB+. The trick: 80% of the model’s weights are stored in a format that no quantization library can touch, so you offload them to CPU RAM instead. Four monkey-patches to make the training stack not crash, a custom autograd hook to stop PyTorch from eating all your VRAM, and about 9 hours of training per run. Three runs produced broken models before I figured out you can’t just train the attention layers on an MoE model, you need the shared expert FFN layers too. The working pipeline uses 30GB VRAM and ~114GB RAM. If you have a …