Axolotl on Ubuntu 26.04 for Ryzen AI Max+ 395: ROCm 7.2, bitsandbytes and gfx1151 fixes

Step-by-step fixes for Axolotl LoRA fine-tuning on a Geekom A9 with AMD Ryzen AI Max+ 395 (gfx1151) on Ubuntu 26.04: ROCm 7.2 PyTorch nightlies, building bitsandbytes from source, and the Axolotl configuration that works.

A narrow harbour entrance at dusk

Why this matters

One of the practical questions behind RiskNodes is simple: can organisations build useful AI capability without sending code, data, or assessment records into somebody else’s cloud?

For regulated teams, consultancies, research groups, and internal engineering functions, that question is not theoretical. If you want a sovereign AI estate, you need local inference and, in some cases, local fine-tuning. The AMD Ryzen AI Max+ 395 is an interesting platform for that work because it combines a modern integrated GPU with a large unified-memory pool. Properly configured, it can hold model and optimiser state that would normally push you towards a separate workstation GPU.

The difficulty is that the software stack is still immature. Ubuntu 26.04, ROCm, PyTorch nightly builds, bitsandbytes, and Axolotl do work together on this machine, but not by default. This post records the configuration that worked for us on a Geekom A9 with Radeon 8060S graphics (gfx1151). It assumes the hardware is already running a stable Ubuntu 26.04 installation; the previous post in this series covers the firmware preparation, NVMe fixes, and OS installation.

Tested environment

The working setup was:

  • Host: AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151) and 128 GB unified memory, with 64 GB allocated to VRAM
  • OS: Ubuntu 26.04, kernel 7.0.0-14-generic
  • PyTorch: torch 2.13.0.dev20260420+rocm7.2
  • System ROCm SDK: ROCm 7.1 from the Ubuntu repositories
  • Training target: Axolotl 0.16.1 running LoRA fine-tuning against an 8B Mistral or Ministral-class model

The important detail is that the standard Axolotl installation path assumes NVIDIA and tends to pull in CUDA-oriented wheels. On this hardware, the safer route is to install the ROCm PyTorch wheel first, then layer Axolotl on top.

What failed first

Three issues consumed most of the time.

1. Axolotl failed with No target_modules passed

The first smoke run failed with:

ValueError: No `target_modules` passed but also no `target_parameters` found.

The cause was not the model. It was the configuration key. Axolotl no longer infers LoRA target modules automatically when used with current peft releases. A stray target_modules: key was ignored silently. The key that Axolotl actually forwards is lora_target_modules:.

2. bitsandbytes could not find a ROCm 7.2 binary

The next failure looked like this:

RuntimeError: Configured ROCm binary not found at .../libbitsandbytes_rocm72.so

At the time of writing, the published bitsandbytes 0.49.1 wheel includes ROCm binaries for older targets, but not a pre-built binary that matched our combination of ROCm 7.2 reporting and gfx1151 hardware. That matters if you want 4-bit or 8-bit loading, or the adamw_8bit optimiser.

3. The logs contained several distracting but harmless errors

Several warnings appeared repeatedly and turned out not to be the root cause:

  • (null): No such file or directory from capability probes looking for nvidia-smi or ldconfig
  • torchao failures to load Hopper-specific kernels that are irrelevant on AMD
  • Axolotl reporting compute_capability: sm_115, which is simply a poor label for gfx1151
  • A save-time warning about mistral and ministral model types, even though the LoRA adapter weights still saved correctly

These messages are untidy, but they are not, in themselves, a blocker.

The Axolotl configuration that worked

The first fix was to make the LoRA configuration explicit and disable the fused LoRA kernels that currently route into NVIDIA-oriented Triton and torchao paths.

adapter: lora
lora_r: 8
lora_alpha: 16
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Disable fused LoRA kernels on AMD
lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false
lora_embedding_kernel: false

optimizer: adamw_torch
load_in_8bit: false
load_in_4bit: false

With that configuration, training completed end to end on gfx1151. On a small smoke dataset we observed roughly 17 GiB of VRAM reserved and around 11 tokens per second on the second optimiser step for an 8B model with rank-8 LoRA at sequence_len: 1024.

That is not the final performance envelope of the platform. It is, however, enough to establish that the basic PyTorch and Axolotl stack is viable on this class of local hardware.

Building bitsandbytes for ROCm 7.2 and gfx1151

bitsandbytes is not required for a basic LoRA run on a machine with 64 GB of GPU-addressable memory. It does matter if you want to reduce optimiser-state pressure or run larger models with quantised base weights.

Install the ROCm development packages

On Ubuntu 26.04, ROCm packages are laid out under /usr rather than /opt/rocm. We installed:

sudo apt install rocm rocm-dev hipcc rocm-cmake rocm-device-libs-21 libhipblaslt-dev

rocm-dev pulls in most of the required HIP and ROCm development stack. libhipblaslt-dev still needs to be installed separately.

Build the native library from source

CMake 4.2 was unwilling to use the hipcc wrapper directly, but it did detect Clang 21 correctly. We also had to point Clang at the AMDGCN device-library bitcode.

git clone https://github.com/bitsandbytes-foundation/bitsandbytes ~/bbytes/bitsandbytes
cd ~/bbytes/bitsandbytes
git checkout 0.49.1
rm -rf build

cmake -DCOMPUTE_BACKEND=hip \
      -DBNB_ROCM_ARCH="gfx1151" \
      -DCMAKE_HIP_FLAGS="--rocm-path=/usr --rocm-device-lib-path=/usr/lib/llvm-21/lib/clang/21/amdgcn/bitcode" \
      -S . -B build
cmake --build build -j$(nproc)

That produced:

bitsandbytes/libbitsandbytes_rocm71.so

Deploy the library without reinstalling the package

Running pip install -e . was counterproductive in our case. It triggered a fresh build in a temporary directory, defaulted back towards the CPU backend, and then failed in cpu_ops.cpp under GCC 15.

The more reliable route was to copy the already-built shared object into the installed package directory and add the symlink that the Python loader expected:

DST=~/fine-tune/.venv/lib/python3.12/site-packages/bitsandbytes
cp bitsandbytes/libbitsandbytes_rocm71.so "$DST/"
ln -sf libbitsandbytes_rocm71.so "$DST/libbitsandbytes_rocm72.so"

PyTorch reported HIP 7.2, so bitsandbytes looked for a rocm72 filename even though the system SDK package version was 7.1.

Smoke test the result

python -c "
import torch, bitsandbytes as bnb
print('bnb', bnb.__version__)
x = torch.randn(64, 64, device='cuda', dtype=torch.float32)
q, s = bnb.functional.quantize_blockwise(x)
print('quantize ok:', q.shape, q.dtype)
"

Expected output:

quantize ok: torch.Size([64, 64]) torch.uint8

Keep the Python package and native library on the same version

This point is easy to overlook and causes a runtime failure rather than a build failure, which makes it harder to diagnose. Our first attempt used the main branch of bitsandbytes, which built a native library missing the get_cusparse symbol expected by the installed 0.49.1 Python package. The build itself completed without error, but the package failed at import time.

The fix is to build the exact same version as the installed wheel — git checkout 0.49.1 before building. If the wheel is upgraded via pip later, the native library and symlink will need to be replaced as well, since the new wheel will overwrite the package directory.

What still needs work

This configuration is enough for practical LoRA training on 8B models, and the path to larger models is open. Increasing the BIOS UMA allocation from 64 GB to 96 GB leaves the base weights of a 26B model in bf16 (~52 GB) with comfortable headroom for activations and optimiser state — no quantisation required. That makes 26B-class models such as Gemma 4 27B the natural next target on this platform.

The remaining loose ends are:

  • flash-attn was not attempted
  • Longer sequence lengths will need proper testing beyond a smoke dataset
  • adamw_8bit remains the next obvious optimisation once bitsandbytes is wired in cleanly
  • The manual bitsandbytes build should probably become a repeatable install script
  • Upstream support for ROCm 7.2 and gfx1151 would remove most of the awkwardness here

Why this is relevant beyond hobbyist hardware

Although this post is technical, the business relevance is straightforward. Organisations in banking, defence, government, healthcare, engineering, and specialist consulting increasingly want the productivity benefits of modern model tooling without accepting the governance and data-transfer assumptions of public cloud services.

That is the same problem we address in RiskNodes at the application layer. If you want sovereign-first AI operations, the surrounding infrastructure also has to be workable on local hardware. A small, reliable path for local fine-tuning on Ryzen AI Max systems is therefore useful in its own right. It also helps make the larger operating model credible: models can be evaluated, adapted, and governed inside the client’s perimeter rather than exported to somebody else’s platform.

In short, getting Axolotl running on this hardware is not the product. But it is part of the practical foundation for the sort of deployable, audit-friendly AI estate that the product assumes.

Conclusion

A compact desktop unit with 64 GB of GPU-addressable memory, a mainstream Linux distribution, and open-source tooling can now run LoRA fine-tuning against an 8B model. That combination would have required a dedicated server room three years ago. The remaining friction — a hand-built shared library, an explicit configuration key — is real, but it is narrowing quickly. ROCm support is broadening with each release cycle, upstream projects are adding pre-built binaries for newer AMD targets, and hardware like the Ryzen AI Max+ 395 is bringing unified-memory architectures into a price bracket accessible to small teams and individual practitioners.

If you are assembling a sovereign AI estate today, the practical conclusion is encouraging: local inference and fine-tuning on current hardware are no longer experimental. The stack works. The workarounds documented here are, in all likelihood, temporary.

Share this article:

Related posts

When the Computer Stops Being the Bottleneck - AI, Customisation, and the New Division of Labour

October 15, 2025

When the Computer Stops Being the Bottleneck - AI, Customisation, and the New Division of Labour

How generative AI reshapes enterprise software by letting users customise behaviour through language, and why RiskNodes’s CEL-based design fits the AI-assisted future.

How a Dogma About Language Stalled AI for Fifty Years

February 24, 2026

How a Dogma About Language Stalled AI for Fifty Years

In 1955 a small Cambridge lab had the right ideas about computational language. Then Noam Chomsky's revolution swept them aside — and it took half a century, and vastly more powerful hardware, to prove the original intuition correct.

Agentic Intelligence Risk Management: Treating AI Agents as Third-Party Vendors

April 02, 2026

Agentic Intelligence Risk Management: Treating AI Agents as Third-Party Vendors

RiskNodes is now open source and repositioned as an Agentic Intelligence Risk Management platform — applying twenty years of vendor-assessment discipline to the challenge of governing AI-generated code.