Fine-Tuning GLM-4.7-Flash (30B MoE) with LoRA on a Single 48GB GPU

By Mădălin Dogaru

Research

February 25, 2025 33 min to read

TL;DR

I fine-tuned a 30B parameter model (GLM-4.7-Flash) on a single 48GB GPU. Everyone said you need 60GB+. The trick: 80% of the model’s weights are stored in a format that no quantization library can touch, so you offload them to CPU RAM instead. Four monkey-patches to make the training stack not crash, a custom autograd hook to stop PyTorch from eating all your VRAM, and about 9 hours of training per run. Three runs produced broken models before I figured out you can’t just train the attention layers on an MoE model, you need the shared expert FFN layers too. The working pipeline uses 30GB VRAM and ~114GB RAM. If you have a 48GB GPU and 128GB+ of system memory, you can do this.

Abstract

This whole thing started, in my naivety, as a quick check to see how fine-tuning a model for offensive security operations impacts its capabilities but then it morphed into research on how to fine-tune a model that does not fit on your GPU VRAM. These are my documented steps with no restraints on details, so this is not for everyone.

GLM-4.7-Flash is a 30 billion parameter MoE language model with 64 experts per layer and roughly 3 billion active parameters per forward pass. The fused 3D expert weight tensors cannot be quantized by BitsAndBytes, which means 80% of the model stays in full bf16 no matter what quantization settings you throw at it. 48GB is not enough. Except it is, if you’re willing to go deep. The approach I landed on combines 8-bit quantization with CPU offloading, four monkey-patches to the transformers/accelerate/bitsandbytes stack, and a custom autograd tensor management strategy using PyTorch’s saved_tensors_hooks API. Training holds stable at 30GB VRAM with 18GB of headroom. I could not find any prior documented case of LoRA fine-tuning this model on hardware with less than 60GB of VRAM so I had to do it myself.

1. Hardware Configuration

Component	Specification
GPU	NVIDIA RTX PRO 5000 Blackwell, 48GB GDDR7, PCIe 5.0 x16
CPU	AMD Ryzen 9 9950X, 16 cores / 32 threads, Zen 5
Motherboard	ASUS ProArt X870E-CREATOR WIFI
RAM	192GB DDR5 (186GB usable)
Storage	Samsung 970 NVMe SSD
OS	Ubuntu 24.04.3 LTS, kernel 6.17.0-14-generic
NVIDIA Driver	590.48.01 (nvidia-driver-open, open kernel module)

System specs — AMD Ryzen 9 9950X, 192GB DDR5, NVIDIA RTX PRO 5000 Blackwell

nvidia-smi output showing RTX PRO 5000 Blackwell (48GB GDDR7) at idle

The CPU-offloaded model portion (~28GB) has to live in RAM as real tensors, not memory-mapped, because they get shuttled to and from the GPU on every forward and backward pass, this means you need probably a minimum of 120GB of RAM for this to run with no impact on the OS performance.

2. Software Stack

Package	Version
Python	3.12
PyTorch	2.10.0+cu128
transformers	5.3.0.dev0
accelerate	1.13.0.dev0
bitsandbytes	0.49.2
peft	0.18.1
trl	0.24.0

These versions matter. The patches below target bugs and behaviors in this exact stack. Newer releases might fix some of these, or break things in new ways.

3. The MoE Quantization Problem

3.1 Why Standard Quantization Fails

BitsAndBytes quantization (the backbone of QLoRA) works on nn.Linear layers. It takes 2D weight matrices and replaces them with quantized representations (Int8 or NF4).

GLM-4.7-Flash does not store its expert weights as nn.Linear. They are fused 3D nn.Parameter tensors:

experts.gate_up_proj: shape [64, 3072, 2048], dtype bfloat16
experts.down_proj:    shape [64, 2048, 1536], dtype bfloat16

BitsAndBytes does not see these. Does not touch them. With 46 MoE layers (layers 1-46; layer 0 is dense), the expert weights make up roughly 80% of the total model parameters.

3.2 Measured VRAM Usage

Quantization Mode	Expected VRAM	Measured VRAM	Notes
Full bf16	~60GB	~60GB	Does not fit
8-bit (load_in_8bit)	15-20GB	~46GB	Experts remain bf16
4-bit NF4 (load_in_4bit)	10-15GB	~45GB	Experts remain bf16

The “Expected” column is what you’d get if BitsAndBytes quantized everything. It does not. The gap is the unquantized expert tensors sitting in bf16, and that gap does not shrink no matter what quantization flags you set.

3.3 Alternative Approaches Considered

Approach	Result
DeepSpeed ZeRO-3 + CPU offload	Does not support MoE. Hard limitation.
Unsloth bf16 LoRA with MoE optimizations	Needs ~58GB minimum for this model
transformers-qwen3-moe-fused (GGUF-quantized MoE training)	Qwen3 only
PEFT v0.17+ target_parameters (LoRA on nn.Parameter)	Experimental, untested on fused 3D tensors

None of these work for GLM-4.7-Flash on sub-60GB hardware.

4. Solution Architecture

4.1 Overview

Split the model across GPU and CPU using accelerate’s device_map="auto" with explicit memory caps:

model = AutoModelForCausalLM.from_pretrained(
    "unsloth/GLM-4.7-Flash",
    quantization_config=BitsAndBytesConfig(
        load_in_8bit=True,
        llm_int8_enable_fp32_cpu_offload=True,
    ),
    device_map="auto",
    max_memory={0: "32GiB", "cpu": "100GiB"},
    dtype=torch.bfloat16,
    attn_implementation="eager",   # flash attention has issues with CPU-offloaded layers
)

About 27 layers land on GPU (~30GB). The remaining 20 layers plus embeddings and the LM head go to CPU RAM. Only the nn.Linear attention layers on GPU get 8-bit quantization.

Model loading in progress — 8-bit quantized, split across GPU and CPU

LoRA adapters target the attention projections shared across all experts. In the initial runs, this was attention-only:

target_modules = ["q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj"]

21 million trainable parameters. 0.07% of the 30 billion total. This turned out to be insufficient; see section 20 for why attention-only LoRA breaks MoE models and the corrected target list.

4.2 Training Precision

Base model attention layers: 8-bit quantized via BitsAndBytes (GPU-resident only)
Expert weights: bf16, unquantized, split across GPU and CPU
LoRA adapter weights: bf16, on GPU
Gradient computation: bf16
BitsAndBytes internal matmul: temporarily casts bf16 to fp16 for the int8 GEMM kernel

5. Required Patches

Four monkey-patches, applied before model loading. I found these one crash at a time: fix one, run it again, hit the next wall. Each fixes something specific that breaks when you combine LoRA training with partial CPU offloading.

5.1 Patch 1: transformers v5 Int8Params Constructor

transformers v5 passes _is_hf_initialized to bitsandbytes.nn.Int8Params.__new__(). That constructor does not accept it. TypeError on model load.

API mismatch between transformers 5.x and bitsandbytes 0.49.x.

The TypeError from transformers v5 passing unknown kwargs to BitsAndBytes

import bitsandbytes as bnb
_orig_int8_new = bnb.nn.Int8Params.__new__
def _patched_int8_new(cls, data=None, requires_grad=True, **kwargs):
    kwargs.pop("_is_hf_initialized", None)
    return _orig_int8_new(cls, data=data, requires_grad=requires_grad, **kwargs)
bnb.nn.Int8Params.__new__ = _patched_int8_new

5.2 Patch 2: 8-bit Backward Scale Factor Recovery

BitsAndBytes 8-bit layers store quantized weights (CB) and a per-row scale factor (SCB). During forward, accelerate hooks move CPU-offloaded layers to GPU, the 8-bit matmul computes CB and SCB. After forward, the patched hooks (Patch 4) move layers back to CPU. On the next access during backward, SCB is None while CB is still there. Crash in MatMul8bitLt.backward.

The scale factor lives on the MatmulLtState object and gets partially invalidated during the CPU/GPU transfer cycle.

import bitsandbytes.autograd._functions as bnb_fn
_orig_matmul_backward = bnb_fn.MatMul8bitLt.backward
@staticmethod
def _patched_matmul_backward(ctx, grad_output):
    state = ctx.state
    if state.CB is not None and state.SCB is None:
        state.SCB = state.CB.abs().max(dim=1).values.float()
    return _orig_matmul_backward(ctx, grad_output)
bnb_fn.MatMul8bitLt.backward = _patched_matmul_backward

CB.abs().max(dim=1).values.float() recomputes the correct per-row absolute maximum scale factor.

5.3 Patch 3: CUDA Caching Allocator Warmup

transformers v5 tries to pre-allocate a memory block sized to the full model as a CUDA allocator warmup. When you are already at 30GB out of 48GB, this just kills the process before training starts.

import transformers.modeling_utils
transformers.modeling_utils.caching_allocator_warmup = lambda *args, **kwargs: None

5.4 Patch 4: Accelerate Post-Forward Hook (Meta Device Prevention)

This is the big one. When accelerate offloads layers to CPU via AlignDevicesHook, the lifecycle is:

pre_forward: load layer weights from CPU to GPU
Model runs the layer forward pass on GPU
post_forward: replace GPU weights with meta tensors (empty placeholders, shape and dtype but zero data)

Meta tensors save memory and work fine for inference. For training they are a problem. The backward pass needs the actual weight data to compute gradients. Meta tensors have no data. Gradient computation fails.

Patch post_forward to move weights to CPU (real data preserved) instead of replacing with meta:

from accelerate.hooks import AlignDevicesHook
from accelerate.utils import named_module_tensors

_orig_post_forward = AlignDevicesHook.post_forward

def _patched_post_forward(self, module, output):
    if self.offload:
        for name, _ in named_module_tensors(
            module, include_buffers=self.offload_buffers, recurse=False
        ):
            try:
                param = getattr(module, name)
                if param is not None and param.device.type not in ('cpu', 'meta'):
                    param.data = param.data.to('cpu')
            except Exception:
                pass
    if self.io_same_device and self.input_device is not None:
        if isinstance(output, torch.Tensor):
            output = output.to(self.input_device)
        elif isinstance(output, tuple):
            output = tuple(
                o.to(self.input_device) if isinstance(o, torch.Tensor) else o
                for o in output
            )
    return output

AlignDevicesHook.post_forward = _patched_post_forward

After loading, run a dummy forward pass to cycle all meta parameters through GPU and back to CPU as real tensors:

dummy = tokenizer("test", return_tensors="pt").to("cuda")
with torch.no_grad():
    _ = model(**dummy)
del dummy
gc.collect()
torch.cuda.empty_cache()

Post-materialization state: 0 meta parameters, ~347 CPU parameters, ~404 GPU parameters.

6. Autograd Memory Management

6.1 The Accumulation Problem

All four patches applied. Model loads. LoRA attaches. I thought I was done. First training step: OOM at 45GB. The base model only uses 30GB. Where did the other 15GB come from?

One of many CUDA OOM crashes during the debugging process

PyTorch’s autograd graph. During forward with requires_grad=True, PyTorch saves references to tensors involved in each computation for use during backward. Accelerate hooks load each CPU-offloaded layer to GPU for its forward pass, but autograd grabs a reference to the GPU tensor before the patched hook moves it back to CPU. After all 47 layers, autograd holds GPU references to every single layer’s weights at the same time. 30GB of base model + 15GB of autograd-pinned weight references = OOM.

6.2 Solution: saved_tensors_hooks

torch.autograd.graph.saved_tensors_hooks intercepts every tensor save/load in the autograd graph. I used it to push saved tensors to CPU the moment autograd captures them:

def pack_to_cpu(tensor):
    """Intercept autograd tensor save: move CUDA tensors to CPU."""
    if tensor.device.type == 'cuda':
        cpu_tensor = tensor.to('cpu')
        return (cpu_tensor, 'cuda:0')
    return (tensor, None)

def unpack_from_cpu(packed):
    """Intercept autograd tensor load: move back to GPU for backward."""
    tensor, device = packed
    if device is not None and tensor.device.type == 'cpu':
        return tensor.to(device)
    return tensor

At any point during forward or backward, only the current layer’s tensors sit on GPU. Everything else is parked in CPU RAM.

6.3 Gradient Checkpointing Incompatibility

saved_tensors_hooks cannot coexist with gradient checkpointing:

use_reentrant=False: Gradient checkpointing uses the same saved_tensors_hooks API internally. Nesting them produces CheckpointError: Recomputed values for tensors have different metadata because the inner hook alters tensor locations.
use_reentrant=True: During checkpoint recomputation, the model re-runs forward layers. Patch 4 moves weights to CPU after forward. Backward tries to recompute on GPU but the weights are on CPU. Device mismatch.

Disable gradient checkpointing entirely. saved_tensors_hooks gives you equivalent or better memory savings by offloading all saved tensors, not just activation checkpoints.

7. Custom Trainer Implementation

7.1 Training Step Override

The saved_tensors_hooks context has to wrap the entire training step (forward + backward + gradient accumulation):

class OffloadSFTTrainer(SFTTrainer):
    def training_step(self, *args, **kwargs):
        with torch.autograd.graph.saved_tensors_hooks(pack_to_cpu, unpack_from_cpu):
            result = super().training_step(*args, **kwargs)
        torch.cuda.empty_cache()
        return result

    def prediction_step(self, *args, **kwargs):
        with torch.autograd.graph.saved_tensors_hooks(pack_to_cpu, unpack_from_cpu):
            result = super().prediction_step(*args, **kwargs)
        torch.cuda.empty_cache()
        return result

You need both overrides. training_step for training, prediction_step for eval. Miss the second one and your training completes an entire epoch, then OOMs at the eval checkpoint. Ask me how I know.

7.2 Eval Batch Size

HuggingFace Trainer defaults per_device_eval_batch_size to 8 if you don’t set it. Does not matter that you set per_device_train_batch_size to 1. With GLM-4.7-Flash’s vocabulary of 154,820 tokens, eval batch size 8 means:

[8, 2048, 154820] * 4 bytes (float32 for cross_entropy) = ~10.1 GB

Set it to 1:

SFTConfig(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    ...
)

This one killed my first full training attempt at the halfway point. All 85 training steps of epoch 1 completed fine over 4 hours. Then eval kicked in between epochs and OOM’d instantly. The traceback points at cross_entropy, not at batch size. Nothing in the error message tells you the actual problem. You just see an OOM in a loss function and start questioning everything you built.

Epoch 1 training completes, then OOM at eval — the hidden batch size default

7.3 LoRA Weight Device Placement

After get_peft_model(), LoRA adapter weights inherit the device of their parent base layers. CPU-offloaded layers produce CPU-resident LoRA weights. Move them to GPU:

for name, param in model.named_parameters():
    if param.requires_grad and param.device.type != 'cuda':
        param.data = param.data.to('cuda:0')

8. Training Configuration

8.1 LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Target modules are the attention projections shared across all experts. These are nn.Linear layers, compatible with standard LoRA. The fused expert nn.Parameter tensors stay frozen.

Trainable parameters: 21,031,936 (0.07% of 29,964,422,912 total).

Note: This attention-only configuration produced models with degraded output quality across three training runs. The corrected configuration (section 20) adds shared expert FFN layers (gate_proj, up_proj, down_proj) to the target list.

8.2 Training Hyperparameters

These are the Run 1 hyperparameters. They were too aggressive; see section 14 for the diagnosis and section 18 for the corrected values.

SFTConfig(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,     # effective batch size = 8
    num_train_epochs=2,
    learning_rate=2e-4,                # TOO HIGH — see section 17
    warmup_ratio=0.05,
    weight_decay=0.01,
    bf16=True,
    max_length=2048,
    gradient_checkpointing=False,      # incompatible with saved_tensors_hooks
    dataloader_pin_memory=False,       # pinned memory conflicts with CPU offloading
)

8.3 Dataset

714 instruction-response pairs covering security domain knowledge (offensive and defensive), split 95/5 into 678 training and 36 validation samples. Formatted with GLM-4.7-Flash’s chat template using system/user/assistant roles. Longest sequence observed: ~1,500 tokens, well within the 2,048 limit.

9. Training Results

9.1 Resource Utilization

Metric	Value
Base VRAM after model load	29.2 GB
VRAM after LoRA attachment	30.0 GB
VRAM during training (stable)	30.0 GB
Available VRAM headroom	18 GB
System RAM usage (model offload)	~28 GB
Total system RAM used (including OS)	~114 GB

9.2 Training Metrics (Epoch 1)

Step	Loss	Grad Norm	Token Accuracy	Epoch
5	1.479	0.092	67.6%	0.06
25	1.243	0.042	69.7%	0.30
45	1.165	0.046	71.4%	0.53
65	1.126	0.041	73.3%	0.77
85	1.116	0.057	71.7%	1.00

Epoch 1 Evaluation: eval_loss=1.209, eval_token_accuracy=71.6%

Epoch 1 training in progress — loss decreasing, token accuracy climbing

9.3 Training Speed

Metric	Value
Seconds per step	~170-190
Steps per epoch	85
Time per epoch	~4.5 hours
Total training time (2 epochs)	~9 hours
Adapter size (safetensors)	81 MB

The per-step cost comes from CPU/GPU tensor transfers. Each step runs forward through all 47 layers (~20 of which require CPU-to-GPU weight transfers), then backward through all of them again (same transfers plus autograd unpacking), and does this 8 times per optimizer step because of gradient accumulation.

~3 minutes per step. About 10x slower than pure-GPU training. Not fast, but it runs and it finishes. I started it before bed and it was done by morning.

9.4 Epoch 2 Metrics

Learning rate decays from 1.068e-4 toward zero during epoch 2. Loss drops significantly:

Step	Loss	Grad Norm	Token Accuracy	Epoch
90	1.106	0.048	72.0%	1.06
100	0.996	0.039	73.8%	1.18
115	1.040	0.049	73.6%	1.35
130	0.923	0.039	75.5%	1.53
145	1.034	0.045	73.9%	1.71
160	1.056	0.048	72.8%	1.88
170	0.967	0.046	74.4%	2.00

Epoch 2 Evaluation: eval_loss=1.142, eval_token_accuracy=72.4%

Epoch 2 in progress — loss dropping below 1.0, token accuracy reaching 75%

9.5 Summary

Metric	Epoch 1	Epoch 2
Final training loss	1.116	0.967
Token accuracy	71.7%	74.4%
Eval loss	1.209	1.142
Eval token accuracy	71.6%	72.4%

Training loss dropped 13% between epochs and eval loss improved alongside it, so the model appeared to be generalizing rather than memorizing. The gap between training loss (0.967) and eval loss (1.142) looked healthy. ~9 hours for 170 steps across 2 epochs. At this point I thought I was done. I was not.

Full training complete — 170/170 steps, final loss 1.0941, adapter saved

10. Discussion

10.1 Why This Works

The 8-bit quantization gets the GPU-resident portion from ~37GB down to ~30GB. CPU offloading parks 20 layers in system RAM so the GPU stays at 30GB. Patch 4 keeps real tensor data on CPU-offloaded layers instead of replacing them with empty meta placeholders, which is what lets backward actually compute gradients through those layers. And saved_tensors_hooks stops autograd from holding GPU references to every layer’s weights simultaneously, which is where the mystery 15GB was coming from.

All four are load-bearing. Pull any one and you get a crash or an OOM.

10.2 Limitations

It is slow. ~170-190 seconds per step, roughly 10x what you’d get if the whole model fit in VRAM. Fine for datasets under 1,000 pairs. For anything much larger you’d be waiting days.

Only attention layers get LoRA adapters because the expert weights are fused 3D nn.Parameter tensors, not nn.Linear. The experts stay frozen. PEFT v0.17+ has experimental target_parameters support that might open this up eventually.

The patches target specific library versions. When transformers or accelerate ship a new release, expect internal APIs to shift and some of these to break.

RAM usage is significant. The CPU-offloaded weights (~28GB) plus autograd saved tensors (~40-60GB) plus OS overhead put you at ~114GB. 128GB is the realistic minimum; I had 192GB and it used about 60% of it.

10.3 Applicability to Other MoE Models

This should work on other MoE models with fused expert tensors: Qwen3-MoE variants, DeepSeek-V2/V3 MoE. Mixtral uses nn.Linear experts so standard QLoRA already works there. The key check is whether your model’s expert weights are nn.Parameter (needs this approach) or nn.Linear (you don’t need any of this).

11. Reproducing This Work

11.1 Environment Setup

python3 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install transformers accelerate bitsandbytes peft trl datasets

11.2 Required Environment Variable

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This tells PyTorch’s CUDA allocator to use expandable segments, which reduces fragmentation when tensors are constantly being allocated and freed (which is exactly what happens during CPU/GPU transfers).

11.3 Execution

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train_adapter.py --type general

Make sure nothing else is using the GPU. Ollama, inference servers, anything GPU-accelerated. Stop them before you start. I lost a 9-hour training run at 98% completion (step 167 of 170) because I asked Ollama a question and it loaded the model mid-training. 15GB of VRAM gone. The training process needed 384MB more and there was nowhere to get it. Three steps from the finish line. Don’t be me.

11.4 Smoke Test

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python train_adapter.py --type general --smoke-test

50 training pairs, 1 epoch. Done in about 20 minutes. Validates the full pipeline: model loading, patching, LoRA, training, eval, adapter save. Run this first.

12. Final Thoughts

This was my first time training a model. I went in knowing Python and cybersecurity, not PyTorch internals. Every OOM crash was a learning opportunity I didn’t ask for. I really didn’t!

The real blocker is not model size. It’s the fact that current quantization libraries cannot handle fused 3D expert tensors. That’s it. Fix that one thing and GLM-4.7-Flash trains on 48GB without any of this. Until quantization libraries catch up, CPU offloading with autograd memory management is the way. Slow, but it works.

If you’re sitting on a single GPU thinking you can’t fine-tune a 30B model, you can. It took me four monkey-patches, one custom trainer, a few crashed training runs, a lot of reading PyTorch source code and lots of screaming at Claude. But the adapter is 81MB and it works.

13. Exporting to GGUF: Five Failed Approaches

Training produces an 81MB PEFT adapter (safetensors). To actually use it in Ollama, you need a single merged GGUF file: base model weights + LoRA deltas baked in, quantized, and packaged. This should be the easy part. It was not.

13.1 Approach 1: Unsloth + BitsAndBytes 8-bit

The obvious first try. Load the model in 8-bit (same as training), merge the LoRA, export:

model = AutoModelForCausalLM.from_pretrained(base_model, load_in_8bit=True, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload()

Result: ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the model.

BitsAndBytes refuses to work when any modules are CPU-offloaded. The merge operation needs the full model on GPU, but 8-bit GLM-4.7-Flash is still ~46GB because of the unquantized expert tensors. Does not fit in 48GB.

13.2 Approach 2: BitsAndBytes 4-bit (NF4)

Drop to 4-bit. Maybe the smaller attention layers free enough room:

model = AutoModelForCausalLM.from_pretrained(base_model, load_in_4bit=True, device_map="auto")

Result: Same error. The 3D fused expert tensors still cannot be quantized by BitsAndBytes, still get dispatched to CPU, still trigger the same rejection. 4-bit saves ~1GB total on the attention layers. The experts are the problem and they do not shrink.

13.3 Approach 3: Full fp16, No Quantization

Skip quantization entirely. Load in fp16, let device_map="auto" figure it out:

model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")

Result: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB at layer 36 during expert tensor concatenation. The model in fp16 is ~60GB. GPU has 48GB. Even with CPU offloading, the merge operation tries to materialize large intermediate tensors on GPU.

13.4 Approach 4: llama.cpp LoRA-to-GGUF Direct Conversion

Skip the merge entirely. Convert just the LoRA adapter to a GGUF file and let Ollama apply it at inference time via ADAPTER in the Modelfile:

python llama.cpp/convert_lora_to_gguf.py ./adapters/general/ --outfile adapter.gguf

Result: Two problems.

First, installing llama.cpp’s Python requirements nuked the training venv. It pulled in torch==2.6.0+cpu (replacing 2.10.0+cu128), transformers==4.57.6 (replacing 5.3.0.dev0), and numpy==1.26.4 (replacing 2.4.2). The older transformers version does not recognize GLM-4.7-Flash’s architecture (KeyError: 'glm4_moe_lite'). Had to manually reinstall every package to restore the training environment.

Second, after fixing the venv, the converter hit NotImplementedError when trying to split the kv_b_proj LoRA tensor. GLM-4.7-Flash uses a decomposed KV projection (kv_a_proj_with_mqa → kv_b_proj) that requires tensor splitting during GGUF conversion. The llama.cpp LoRA converter does not implement this for the GLM architecture.

13.5 Approach 5: CPU Merge (The One That Works)

The realization: you have 192GB of RAM. The model is 60GB in bf16. Just load the whole thing on CPU.

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map={"": "cpu"},     # everything on CPU, no GPU needed
    low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, adapter_path, device_map={"": "cpu"})
model = model.merge_and_unload()
model.save_pretrained(merged_path, safe_serialization=True)

No GPU. No quantization during merge. No BitsAndBytes. No accelerate hooks. Load on CPU, merge on CPU, save as HF safetensors. Then use llama.cpp’s convert_hf_to_gguf.py (not the LoRA converter) to quantize the merged model to Q8_0:

python ~/llama.cpp/convert_hf_to_gguf.py ./merged_hf/ --outtype q8_0 --outfile dracula-general-q8_0.gguf

Result: Success. 844 tensors, 31.8GB Q8_0 GGUF. Completed in about 5 minutes: 2 minutes for the CPU merge, 2.5 minutes for the GGUF write. Peak RAM usage: ~120GB.

The intermediate merged HF model is ~56GB on disk. Delete it after GGUF conversion.

13.6 Importing to Ollama

One more obstacle. ollama create dracula-general -f modelfiles/Modelfile.general returns 400 Bad Request: invalid model name. Tried every naming variation: underscores, colons, namespaces, direct API calls. All failed.

Ollama 0.16.1 has a bug (or undocumented behavior change) where the -f flag for specifying a Modelfile path does not work with ollama create. The workaround: copy the Modelfile to the current directory as Modelfile (the default filename), then run ollama create dracula-general with no -f flag.

cp modelfiles/Modelfile.general Modelfile
ollama create dracula-general

Model loaded. Both glm-4.7-flash:q8_0 (base) and dracula-general:latest (fine-tuned) show in ollama list at 31GB each.

14. The Model Was Broken

14.1 Symptoms

First test via Ollama CLI (ollama run dracula-general "What is AMSI bypass?") looked promising. The model produced security-specific content about AmsiScanBuffer patching. Good enough to pass a quick glance.

Then I tested it through the web UI. Asked it to build a Windows 11 keylogger. The response started reasonable, then devolved into a loop, repeating the same block about Windows Defender’s Credential Guard, KPP/HVCI, and WDAC driver enforcement over and over. Hundreds of lines of the same paragraphs, cycling endlessly.

Other prompts produced different kinds of garbage:

"What is nmap?" → "how toSudo: command not found on windows 11 : r/sysadmin
NMAP for beginners — part one NMap vs. Nessus (port scanning and
vulnerability assessment)Why you should use the -sU flag..."

Incoherent fragments. Web page titles mixed with commands. Hallucinated tool names (namp, npmp, ncnapenr). Random policy numbers (E1672/S0068/E0470E0559-E06XX from Palo Alto PAN-SA rule sets).

14.2 Root Cause Analysis

Compared the fine-tuned model’s output against the base model with identical prompts. The base model was coherent, structured, and accurate. The fine-tuned model was not. The LoRA training had damaged the model.

The learning rate was the main problem. 2e-4 is fine for 7B models but way too aggressive for 30B. The weight updates were large enough to overwrite base knowledge without learning enough to replace it. On top of that, lora_alpha=32 with lora_rank=16 gives a 2x scaling factor, so every update was doubled before being applied. And with only 527 training pairs at that learning rate, the model memorized fragments of the training data while losing the ability to produce anything coherent on its own.

The frustrating part: the training metrics looked fine the whole time. Loss dropped from 1.48 to 0.97, eval loss improved from 1.21 to 1.14. Nothing in the numbers said “this model is broken.” The damage was subtler than that. The model could still produce security-related tokens. It just could not organize them into sentences that made sense.

14.3 Chat Template Mismatch (Secondary Issue)

Investigating the garbage output revealed a second problem. The Ollama model was missing the GLM-4.7 chat template.

Base model in Ollama:

RENDERER glm-4.7
PARSER glm-4.7

Custom GGUF model in Ollama:

TEMPLATE {{ .Prompt }}     # bare passthrough, no chat formatting

When you build from a raw GGUF file, Ollama does not inherit the RENDERER/PARSER directives. It falls back to {{ .Prompt }}, which passes raw text without the [gMASK]<sop><|user|>...<|assistant|></think> framing the model expects. GLM’s chat template embeds a </think> tag before every assistant response (even with thinking disabled), and without proper framing the model does not know when to start or stop generating.

Fixed by adding RENDERER glm-4.7 and PARSER glm-4.7 to the Modelfile. This resolved the template issue but did not fix the underlying weight damage.

14.4 Repeat Penalty as a Band-Aid

Before diagnosing the root cause, I tried fixing the looping behavior with inference parameters:

PARAMETER repeat_penalty 1.3
PARAMETER repeat_last_n 256
PARAMETER top_k 40
PARAMETER num_predict 2048

The repeat_penalty=1.3 stopped the exact-copy loops (the model no longer repeated identical paragraphs). But the outputs remained incoherent. The model just found new ways to produce garbage instead of repeating the same garbage. Repeat penalty treats the symptom, not the disease.

15. Retraining with Conservative Hyperparameters

15.1 What Changed

Parameter	First Run (broken)	Second Run
`learning_rate`	2e-4	2e-5
`lora_alpha`	32	16
`warmup_ratio`	0.05	0.10
`weight_decay`	0.01	0.05
`num_epochs`	2	3

The thinking was: learning rate 10x lower (2e-5) so the weight updates don’t steamroll base knowledge. Alpha equals rank (16/16 = 1x scaling) instead of the 2x multiplier that amplified Run 1’s damage. Longer warmup (10% vs 5%) to ease into training. Stronger regularization (weight_decay=0.05) to punish the adapter for straying too far from base weights. And one extra epoch because the lower learning rate means the model needs more passes over the data to learn the same amount.

15.2 The Rule of Thumb I Learned

For large models (>13B), start with lr=2e-5 not 2e-4, set lora_alpha = lora_rank (1x scaling), use warmup_ratio >= 0.1, weight_decay >= 0.03, and prefer more epochs at low LR over fewer epochs at high LR.

The lr=2e-4, alpha=2*rank defaults you see in every tutorial were designed for 7B dense models fully quantized to 4-bit. They do not transfer to 30B MoE models with partial quantization and CPU offloading. The gradient dynamics are completely different when 80% of the weights cannot be quantized and 40% of layers are being shuttled between devices.

16. Revised Final Thoughts

Training the model was the easy part. The four patches and autograd management, that was hard, but it was a clean engineering problem. Fix the crash, move on.

The export pipeline and the broken model were a different kind of hard. Five failed export approaches, a nuked Python environment, an Ollama CLI bug, a missing chat template, and a model that looked fine in training metrics but produced garbage at inference. None of these were obvious from the error messages. The OOM during export didn’t say “your model can’t be quantized because the experts are 3D tensors.” The broken model didn’t say “your learning rate was 10x too high.” Each one required reading source code and testing hypotheses against what the tools actually do versus what the docs claim.

Some things I wish I’d known going in: export via CPU merge if your model has non-standard tensor shapes, don’t waste time trying to make GPU-based paths work. Add RENDERER/PARSER to your Ollama Modelfile or it falls back to a bare passthrough template. Start with conservative hyperparameters (lr=2e-5, alpha=rank, warmup=10%, weight_decay=0.05) because you can always push harder later but you can’t undo damage from a learning rate that was too aggressive. And test inference with real prompts, not just training metrics. A model can show improving loss numbers while producing worse actual output. The metrics don’t measure coherence.

17. The Conservative Run Was Wrong Too

I wrote section 15 with confidence. Lower the learning rate, reduce alpha, increase regularization. The obvious correction for a model that had been trained too aggressively. I started training with those settings and went to research hyperparameters while it ran. That research taught me I had overcorrected just as badly in the other direction.

17.1 Finding the Actual Unsloth Config

Most community configs and blog posts citing “use lr=2e-4 for GLM” are copying Unsloth’s demo setting, a 60-step quick test meant to show the pipeline works. The actual Unsloth GLM-4.7-Flash notebook (GLM_Flash_A100(80GB).ipynb) distinguishes two modes:

Demo:  lr=2e-4, max_steps=60        (fast proof-of-concept)
Full:  lr=2e-5, num_epochs=1         (actual production training)

My broken Run 1 used the demo LR for a full 2-epoch run. That was the core mistake.

But the notebook also consistently uses lora_alpha = 2 * lora_rank. Every community config does too. Every piece of research on LoRA scaling confirms it. The “rule of thumb” I wrote in section 15.2, set alpha = rank (1x), was wrong.

17.2 Understanding Effective Learning Rate

The real metric is effective learning rate: learning_rate * (alpha / rank). This is the actual step size applied to the base model weights on every update.

Run	LR	Alpha/Rank	Effective LR	vs Unsloth
Run 1 (broken)	2e-4	32/16 = 2x	4e-4	10x too high
Run 2 (section 15)	2e-5	16/16 = 1x	2e-5	0.5x (too low)
Unsloth official	2e-5	16/8 = 2x	4e-5	baseline

Run 1 had 10x the recommended effective LR. Run 2 had half. The Unsloth config sits right in the middle.

Other community configs for MoE models of this size:

Model	LR	Alpha/Rank	Effective LR
Qwen3-30B-A3B (swift docs)	1e-4	32/8 = 4x	4e-4
Qwen3-VL-30B-A3B (Medium)	2e-4	128/64 = 2x	4e-4
Mixtral 8x7B (vessl.ai)	5e-5	16/8 = 2x	1e-4
DeepSeek-MoE (official)	—	32/8 = 4x	—

The universal consensus: alpha = 2 * rank minimum. Some go to 4x. Nobody uses 1x.

17.3 Why Run 2 Would Have Been Useless

Factor	Value	Problem
Effective LR	2e-5	Half the Unsloth baseline, updates too small
weight_decay	0.05	50x Unsloth’s 0.001, actively penalizing learning
warmup_ratio	0.10	2x standard, wasting more steps at near-zero LR
Combined		Model barely moves from base after 9+ hours

I killed it before it finished. 15 hours of compute to produce something nearly identical to the base model. The overcorrection was just as bad as the original mistake, just in the opposite direction.

17.4 The Corrected Rule of Thumb

Section 15.2 was wrong. Here’s what actually works for 30B MoE models:

learning_rate = 2e-5 for full training (not demos), lora_alpha = 2 * lora_rank (universal, 2x minimum), weight_decay = 0.001, warmup_ratio = 0.05, lora_dropout = 0. Keep it to 1-2 epochs; Sebastian Raschka and Lightning AI both found performance declines with more.

The effective LR should land at 4e-5 for this model size. That’s 10x the full fine-tuning LR of 4e-6, which matches the research finding that LoRA optimal LR is consistently ~10x the FFT rate (arxiv 2602.04998, “LoRA Without Regret”).

18. Run 3: Matching Unsloth Exactly

18.1 Configuration

Parameter	Run 1 (broken)	Run 2 (killed)	Run 3 (this)	Unsloth Official
`learning_rate`	2e-4	2e-5	2e-5	2e-5
`lora_alpha`	32	16	32	16
`lora_rank`	16	16	16	8
Effective LR	4e-4	2e-5	4e-5	4e-5
`weight_decay`	0.01	0.05	0.001	0.001
`warmup_ratio`	0.05	0.10	0.05	~0.05
`lora_dropout`	0.05	0.05	0	0
`num_epochs`	2	3	2	1
Target modules	attn only	attn only	attn only	all-linear

The effective LR now matches Unsloth’s official recommendation exactly: 2e-5 * (32/16) = 4e-5. We use rank 16 instead of Unsloth’s 8, more capacity per adapter to compensate for targeting fewer modules. Higher rank at the same effective LR does not change the step size; it gives each module more dimensions to learn in.

18.2 Training Results

Run 3 model weight loading — 8-bit quantized, GPU/CPU split with corrected hyperparameters

Training completed in approximately 9 hours. 170 steps across 2 epochs, 678 training samples, 36 validation.

Epoch 1:

Step	Loss	Grad Norm	Token Accuracy	LR
5	1.491	0.085	67.4%	1.88e-5
10	1.412	0.069	67.0%	3.53e-5
15	1.424	0.079	67.0%	3.98e-5
25	1.330	0.056	68.2%	3.96e-5
45	1.200	0.052	70.5%	3.64e-5
65	1.168	0.051	71.3%	2.93e-5
85	1.131	0.063	71.7%	2.04e-5

Epoch 1 eval: loss=1.349, token accuracy=70.3%

Epoch 2:

Step	Loss	Grad Norm	Token Accuracy	LR
90	1.115	0.046	71.4%	1.90e-5
105	1.074	0.053	72.1%	1.30e-5
125	1.060	0.052	72.5%	5.84e-6
145	1.040	0.048	72.7%	1.47e-6
170	0.992	0.048	73.1%	0

Epoch 2 eval: loss=1.349, token accuracy=70.3%

Run 3 training complete — full training log showing both epochs and final adapter save

Summary:

Metric	Epoch 1	Epoch 2
Final training loss	1.131	0.992
Token accuracy	71.7%	73.1%
Eval loss	1.349	1.349
Eval token accuracy	70.3%	70.3%

Metrics looked healthy. Gradient norms stayed in the 0.04-0.08 range, no explosions. Loss decreasing smoothly. Eval loss flat between epochs, which suggested the model had extracted about as much as it could from the data in terms of token prediction.

Interesting comparison to Run 1: training loss ended slightly higher (0.992 vs 0.967) but eval loss was significantly worse (1.349 vs 1.142). That wider gap actually makes sense in hindsight. Run 1’s better eval loss was probably just the model memorizing specific token sequences from training, the kind of overfitting that looks good in metrics but produces garbage at inference.

Adapter saved at 81MB. Exported to GGUF via the same CPU merge pipeline from section 13.5.

19. The Model Was Still Broken

19.1 First Signs

Imported the Run 3 GGUF into Ollama with the correct RENDERER/PARSER directives (lesson learned from section 14.3). Ran the first test:

Prompt: "What is nmap? Answer in 2 sentences."

Base model (glm-4.7-flash:q8_0):
  "Nmap (Network Mapper) is a free and open-source utility used for network
   discovery and security auditing. It allows administrators and security
   professionals to scan networks, identify live hosts, open ports, and detect
   service versions."

Fine-tuned (dracula-general, Run 3):
  "NMAP (Network Mapper) identifies open ports on hosts by scanning them
   without sending data that generates logs or alerts—making it ideal for
   discovery before actual attacks.

   It also probes host OS types via TCP fingerprint differences since the
   initial handshake varies between operating systems; this helps choose attack
   payloads tailored specifically toward your target's stack rather than
   guessing blindly at a custom shellcode header based solely off IP addresses
   found through shodan searches like 'site:shoda.n c ip=10.x.xx .x' which
   might trigger IDS/IPS sensors if you hit too many random IPs quickly anyway
   because NSE scripts run against each port independently so they don't require
   re-scanning everything just one-by-one checks instead of bulk scans over all
   known CVEs first then narrow down specific vulnerabilities per service later
   after confirming versions using mismatched banner strings from telnet outputs
   etc... wait getting sidetracked again!"

The base model: clean, factually correct, followed the “2 sentences” instruction.

The fine-tuned model: starts with a plausible-sounding sentence, then immediately derails into an unstructured stream of consciousness. Hallucinated Shodan syntax (site:shoda.n c), factual errors (nmap absolutely does send data that generates logs), self-aware confusion (“wait getting sidetracked again!”), and complete inability to follow the “2 sentences” constraint.

19.2 The Devastating Comparison

The simplest possible test:

Prompt: "Give me a one-liner to find SUID binaries on Linux."

Base model:
  find / -perm -4000 -type f 2>/dev/null

Fine-tuned (temperature 0.3, repeat_penalty 1.5):
  find / -perm 4000 !-name '.*' \( \! '(' '-user root ')' && \
      '\( '\''/usr/local/bin'\'' || '\' '/bin/su*' OR ''\!/sbin/\*'

  Wait that's wrong syntax.

  Correct version:
  sudo grep '^rws---x--$:' $(grep ^S:$(cut-df:/etc/passwd) $PATH)/suid-bins
  No still not right

  Final correct answer:
  (for f; do [ "$(stat-c %a "$1")" = "4755"]&&echo"$2";done)<<(ls-FdR

The base model produced a perfect, standard one-liner. The fine-tuned model produced three different wrong commands, each more broken than the last, with self-correction attempts that made things worse. None of the three commands would execute. Even at temperature 0.3 with aggressive repeat penalty, the model could not produce a simple find command.

This was not the same failure mode as Run 1. Run 1’s model looped endlessly on the same paragraphs. Run 3’s model could at least generate unique content and did not get stuck in repetition loops. But the content was still incoherent. The quality sat somewhere between “complete garbage” and “drunk person explaining hacking”, an improvement over Run 1, but not remotely usable.

19.3 What the Metrics Did Not Tell Me

Run 3’s training metrics were fine by every standard measure. Loss decreased smoothly (1.49 → 0.99), gradient norms were stable, eval loss was flat, token accuracy went from 67% to 73%. None of it predicted what the model would actually produce at inference.

Training metrics measure a very specific thing: cross-entropy loss on held-out examples from the same distribution. They do not measure whether the model can follow instructions, hold a thought across multiple sentences, or write a syntactically valid shell command. The model got better at predicting the next token in training sequences and worse at everything else.

I had diagnosed Run 1’s failure as hyperparameter damage in section 14.2. That was partly right: the high learning rate caused the loops and memorized fragments. But Run 3 used correct hyperparameters and still broke. Something more fundamental was wrong.

20. The Real Problem: Attention-Only LoRA in MoE Models

20.1 What We Were Training

Here is the layer structure of GLM-4.7-Flash:

Layer 0 (dense):
  self_attn:  q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj  (nn.Linear)
  mlp:        gate_proj, up_proj, down_proj                                (nn.Linear)

Layers 1-46 (MoE):
  self_attn:  q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj  (nn.Linear)
  mlp:
    experts:         gate_up_proj [64, 3072, 2048]                         (nn.Parameter, 3D)
                     down_proj    [64, 2048, 1536]                         (nn.Parameter, 3D)
    gate:            weight       [64, 2048]                               (nn.Linear — DO NOT touch)
    shared_experts:  gate_proj, up_proj, down_proj                         (nn.Linear)

Our TARGET_MODULES for all three runs:

["q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj"]

Five attention modules per layer. Zero MLP modules. Zero shared expert modules.

21 million trainable parameters. 0.07% of the model. All in attention.

20.2 Why Attention-Only Breaks MoE Models

In a transformer, the attention layers decide what information to attend to, which tokens in the context matter for predicting the next token. The MLP/expert layers decide what to do with that information. They transform the attended representation into the output.

When you fine-tune only the attention layers, you change what the model looks at without changing what it does with what it sees. The attention mechanism starts routing different information to the MLP layers, but those MLP layers were trained to handle the original attention distribution. They get inputs they were never designed for.

It’s not immediately obvious because the first few tokens look fine. The LoRA deltas are small relative to base weights, so early in a response the attention hasn’t diverged much and the MLP layers can handle it. But by the second or third sentence, the shifted attention patterns have cascaded through 47 layers of residual connections. Each layer feeds the next, the misalignment compounds, and by the time you reach the LM head the representation is far enough off-distribution that the model starts producing tokens that would never normally appear together. That’s where you get the hallucinated URLs, broken syntax, and run-on self-corrections.

20.3 What Every Successful Config Targets

Every published fine-tuning config for MoE models of this class targets both attention and MLP:

Source	Target Modules
Unsloth GLM-4.7-Flash notebook	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
DeepSeek-MoE (official finetune.py)	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Qwen3-30B community configs	`all-linear`
Mixtral 8x7B (Axolotl)	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`

None of them target attention only. None.

I did not realize this because I was focused on the wrong problem. The fused 3D expert tensors (nn.Parameter) cannot have LoRA applied to them. I knew that. I concluded that the only option was attention-only LoRA. I was wrong. There was a third component I had not examined.

20.4 The Shared Experts

Each MoE layer has three types of components. The routed experts (64 per layer, fused 3D nn.Parameter tensors) cannot be LoRA’d. Only 4 of 64 activate per token. The expert gate/router is nn.Linear but research consensus says do not touch it; fine-tuning the router destabilizes expert selection. And then there are the shared experts: one per layer, standard nn.Linear layers (gate_proj, up_proj, down_proj), process every token regardless of routing. They can be LoRA’d. PEFT handles them natively.

I had been staring at the fused 3D expert tensors and concluding “we can’t LoRA the MLP layers.” The shared experts were sitting right next to them in the architecture the whole time. I just didn’t look closely enough.

20.5 The Fix

# Before (Runs 1-3): attention only
TARGET_MODULES = [
    "q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj",
]
# 235 modules, 21M params (0.07%)

# After (Run 4): attention + shared expert FFN
TARGET_MODULES = [
    "q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]
# ~376 modules, ~30-40M params (~0.1-0.13%)

Adding three module names. That is the entire change. Same hyperparameters, same dataset, same training pipeline, same export process.

The gate_proj, up_proj, down_proj names match:

Layer 0’s dense MLP: model.layers.0.mlp.gate_proj (Linear)
Layers 1-46’s shared experts: model.layers.N.mlp.shared_experts.gate_proj (Linear)

PEFT resolves module names by substring matching. The fused 3D expert tensors (model.layers.N.mlp.experts.gate_up_proj and model.layers.N.mlp.experts.down_proj) are nn.Parameter on a Glm4MoeLiteNaiveMoe module. They are not child modules, so PEFT does not see them. Only the nn.Linear instances match.

21. Lessons So Far

Three training runs. Two broken models. One killed mid-training.

Run	Time	Result	What Was Wrong
1	~9h	Infinite loops, hallucinated garbage	LR 10x too high (effective 4e-4)
2	~2h (killed)	Never finished	LR 2x too low (effective 2e-5)
3	~9h	Coherent start → incoherent finish	Attention-only LoRA (no MLP modules)

The hyperparameter corrections between Run 1 and Run 3 were real and necessary, but not sufficient. The fundamental issue was not how aggressively the model was being trained. It was what parts of the model were being trained. Attention and MLP work as a pair. Train one without the other and you get a model that looks at the right things but says the wrong things.

The shared expert FFN layers were standard nn.Linear modules the whole time. No special handling, no additional patches. PEFT supports them out of the box. Every guide I found targets attention + MLP together. I now understand why. It’s not optional.

Run 4 starts now. Same hyperparameters as Run 3. Three additional module names. Another 9 hours to find out if the model can produce a correct find command.