Axolotl on Ubuntu 26.04 for Ryzen AI Max+ 395: ROCm 7.2, bitsandbytes and gfx1151 fixes
Step-by-step fixes for Axolotl LoRA fine-tuning on a Geekom A9 with AMD Ryzen AI Max+ 395 (gfx1151) on Ubuntu 26.04: ROCm 7.2 PyTorch nightlies, building bitsandbytes from source, and the Axolotl configuration that works.
Why this matters
One of the practical questions behind RiskNodes is simple: can organisations build useful AI capability without sending code, data, or assessment records into somebody else’s cloud?
For regulated teams, consultancies, research groups, and internal engineering functions, that question is not theoretical. If you want a sovereign AI estate, you need local inference and, in some cases, local fine-tuning. The AMD Ryzen AI Max+ 395 is an interesting platform for that work because it combines a modern integrated GPU with a large unified-memory pool. Properly configured, it can hold model and optimiser state that would normally push you towards a separate workstation GPU.
The difficulty is that the software stack is still immature. Ubuntu 26.04, ROCm, PyTorch nightly builds, bitsandbytes, and Axolotl do work together on this machine, but not by default. This post records the configuration that worked for us on a Geekom A9 with Radeon 8060S graphics (gfx1151). It assumes the hardware is already running a stable Ubuntu 26.04 installation; the previous post in this series covers the firmware preparation, NVMe fixes, and OS installation.
Tested environment
The working setup was:
- Host: AMD Ryzen AI Max+ 395 with Radeon 8060S (
gfx1151) and 128 GB unified memory, with 64 GB allocated to VRAM - OS: Ubuntu 26.04, kernel
7.0.0-14-generic - PyTorch:
torch 2.13.0.dev20260420+rocm7.2 - System ROCm SDK: ROCm 7.1 from the Ubuntu repositories
- Training target: Axolotl 0.16.1 running LoRA fine-tuning against an 8B Mistral or Ministral-class model
The important detail is that the standard Axolotl installation path assumes NVIDIA and tends to pull in CUDA-oriented wheels. On this hardware, the safer route is to install the ROCm PyTorch wheel first, then layer Axolotl on top.
What failed first
Three issues consumed most of the time.
1. Axolotl failed with No target_modules passed
The first smoke run failed with:
ValueError: No `target_modules` passed but also no `target_parameters` found.
The cause was not the model. It was the configuration key. Axolotl no longer infers LoRA target modules automatically when used with current peft releases. A stray target_modules: key was ignored silently. The key that Axolotl actually forwards is lora_target_modules:.
2. bitsandbytes could not find a ROCm 7.2 binary
The next failure looked like this:
RuntimeError: Configured ROCm binary not found at .../libbitsandbytes_rocm72.so
At the time of writing, the published bitsandbytes 0.49.1 wheel includes ROCm binaries for older targets, but not a pre-built binary that matched our combination of ROCm 7.2 reporting and gfx1151 hardware. That matters if you want 4-bit or 8-bit loading, or the adamw_8bit optimiser.
3. The logs contained several distracting but harmless errors
Several warnings appeared repeatedly and turned out not to be the root cause:
(null): No such file or directoryfrom capability probes looking fornvidia-smiorldconfigtorchaofailures to load Hopper-specific kernels that are irrelevant on AMD- Axolotl reporting
compute_capability: sm_115, which is simply a poor label forgfx1151 - A save-time warning about
mistralandministralmodel types, even though the LoRA adapter weights still saved correctly
These messages are untidy, but they are not, in themselves, a blocker.
The Axolotl configuration that worked
The first fix was to make the LoRA configuration explicit and disable the fused LoRA kernels that currently route into NVIDIA-oriented Triton and torchao paths.
adapter: lora
lora_r: 8
lora_alpha: 16
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Disable fused LoRA kernels on AMD
lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false
lora_embedding_kernel: false
optimizer: adamw_torch
load_in_8bit: false
load_in_4bit: false
With that configuration, training completed end to end on gfx1151. On a small smoke dataset we observed roughly 17 GiB of VRAM reserved and around 11 tokens per second on the second optimiser step for an 8B model with rank-8 LoRA at sequence_len: 1024.
That is not the final performance envelope of the platform. It is, however, enough to establish that the basic PyTorch and Axolotl stack is viable on this class of local hardware.
Building bitsandbytes for ROCm 7.2 and gfx1151
bitsandbytes is not required for a basic LoRA run on a machine with 64 GB of GPU-addressable memory. It does matter if you want to reduce optimiser-state pressure or run larger models with quantised base weights.
Install the ROCm development packages
On Ubuntu 26.04, ROCm packages are laid out under /usr rather than /opt/rocm. We installed:
sudo apt install rocm rocm-dev hipcc rocm-cmake rocm-device-libs-21 libhipblaslt-dev
rocm-dev pulls in most of the required HIP and ROCm development stack. libhipblaslt-dev still needs to be installed separately.
Build the native library from source
CMake 4.2 was unwilling to use the hipcc wrapper directly, but it did detect Clang 21 correctly. We also had to point Clang at the AMDGCN device-library bitcode.
git clone https://github.com/bitsandbytes-foundation/bitsandbytes ~/bbytes/bitsandbytes
cd ~/bbytes/bitsandbytes
git checkout 0.49.1
rm -rf build
cmake -DCOMPUTE_BACKEND=hip \
-DBNB_ROCM_ARCH="gfx1151" \
-DCMAKE_HIP_FLAGS="--rocm-path=/usr --rocm-device-lib-path=/usr/lib/llvm-21/lib/clang/21/amdgcn/bitcode" \
-S . -B build
cmake --build build -j$(nproc)
That produced:
bitsandbytes/libbitsandbytes_rocm71.so
Deploy the library without reinstalling the package
Running pip install -e . was counterproductive in our case. It triggered a fresh build in a temporary directory, defaulted back towards the CPU backend, and then failed in cpu_ops.cpp under GCC 15.
The more reliable route was to copy the already-built shared object into the installed package directory and add the symlink that the Python loader expected:
DST=~/fine-tune/.venv/lib/python3.12/site-packages/bitsandbytes
cp bitsandbytes/libbitsandbytes_rocm71.so "$DST/"
ln -sf libbitsandbytes_rocm71.so "$DST/libbitsandbytes_rocm72.so"
PyTorch reported HIP 7.2, so bitsandbytes looked for a rocm72 filename even though the system SDK package version was 7.1.
Smoke test the result
python -c "
import torch, bitsandbytes as bnb
print('bnb', bnb.__version__)
x = torch.randn(64, 64, device='cuda', dtype=torch.float32)
q, s = bnb.functional.quantize_blockwise(x)
print('quantize ok:', q.shape, q.dtype)
"
Expected output:
quantize ok: torch.Size([64, 64]) torch.uint8
Keep the Python package and native library on the same version
This point is easy to overlook and causes a runtime failure rather than a build failure, which makes it harder to diagnose. Our first attempt used the main branch of bitsandbytes, which built a native library missing the get_cusparse symbol expected by the installed 0.49.1 Python package. The build itself completed without error, but the package failed at import time.
The fix is to build the exact same version as the installed wheel — git checkout 0.49.1 before building. If the wheel is upgraded via pip later, the native library and symlink will need to be replaced as well, since the new wheel will overwrite the package directory.
What still needs work
This configuration is enough for practical LoRA training on 8B models, and the path to larger models is open. Increasing the BIOS UMA allocation from 64 GB to 96 GB leaves the base weights of a 26B model in bf16 (~52 GB) with comfortable headroom for activations and optimiser state — no quantisation required. That makes 26B-class models such as Gemma 4 27B the natural next target on this platform.
The remaining loose ends are:
flash-attnwas not attempted- Longer sequence lengths will need proper testing beyond a smoke dataset
adamw_8bitremains the next obvious optimisation once bitsandbytes is wired in cleanly- The manual bitsandbytes build should probably become a repeatable install script
- Upstream support for ROCm 7.2 and
gfx1151would remove most of the awkwardness here
Why this is relevant beyond hobbyist hardware
Although this post is technical, the business relevance is straightforward. Organisations in banking, defence, government, healthcare, engineering, and specialist consulting increasingly want the productivity benefits of modern model tooling without accepting the governance and data-transfer assumptions of public cloud services.
That is the same problem we address in RiskNodes at the application layer. If you want sovereign-first AI operations, the surrounding infrastructure also has to be workable on local hardware. A small, reliable path for local fine-tuning on Ryzen AI Max systems is therefore useful in its own right. It also helps make the larger operating model credible: models can be evaluated, adapted, and governed inside the client’s perimeter rather than exported to somebody else’s platform.
In short, getting Axolotl running on this hardware is not the product. But it is part of the practical foundation for the sort of deployable, audit-friendly AI estate that the product assumes.
Conclusion
A compact desktop unit with 64 GB of GPU-addressable memory, a mainstream Linux distribution, and open-source tooling can now run LoRA fine-tuning against an 8B model. That combination would have required a dedicated server room three years ago. The remaining friction — a hand-built shared library, an explicit configuration key — is real, but it is narrowing quickly. ROCm support is broadening with each release cycle, upstream projects are adding pre-built binaries for newer AMD targets, and hardware like the Ryzen AI Max+ 395 is bringing unified-memory architectures into a price bracket accessible to small teams and individual practitioners.
If you are assembling a sovereign AI estate today, the practical conclusion is encouraging: local inference and fine-tuning on current hardware are no longer experimental. The stack works. The workarounds documented here are, in all likelihood, temporary.
Share this article:
Related posts
October 15, 2025
When the Computer Stops Being the Bottleneck - AI, Customisation, and the New Division of Labour
How generative AI reshapes enterprise software by letting users customise behaviour through language, and why RiskNodes’s CEL-based design fits the AI-assisted future.
February 24, 2026
How a Dogma About Language Stalled AI for Fifty Years
In 1955 a small Cambridge lab had the right ideas about computational language. Then Noam Chomsky's revolution swept them aside — and it took half a century, and vastly more powerful hardware, to prove the original intuition correct.