No description
  • Shell 74.1%
  • Dockerfile 25.9%
Find a file
sjenkins 0676ac7ae3
Some checks failed
/ Build & Push (push) Failing after 4s
Mount Docker socket into act container
The DinD runner needs access to the host's Docker socket for
docker build and docker push to work.
2026-06-04 21:31:20 -05:00
.forgejo/workflows Use git.home-net.work as registry endpoint 2026-06-04 21:17:12 -05:00
deployments vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0 2026-06-03 11:48:37 -05:00
scripts vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0 2026-06-03 11:48:37 -05:00
.actrc Mount Docker socket into act container 2026-06-04 21:31:20 -05:00
Dockerfile vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0 2026-06-03 11:48:37 -05:00
hf-cache-pvc.yaml vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0 2026-06-03 11:48:37 -05:00
README.md vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0 2026-06-03 11:48:37 -05:00

vLLM on Intel Arc Pro B70 — Kubernetes Deployment

Overview

Run upstream vLLM v0.21.0 on 4× Intel Arc Pro B70 (Battlemage / Xe2, 32GB each) for Qwen3.6 models — without Intel's older llm-scaler-vllm fork.

Stack

Component Version Notes
vLLM 0.21.0 Stock upstream, zero source patches
Python 3.12 From base image
oneAPI 2026.0 Bundled in base image
PyTorch XPU 2.11.0+xpu Pre-built wheel from PyTorch XPU index (bundles its own libsycl)
vllm-xpu-kernels 0.1.8.2 Prebuilt wheel from GitHub release
Base image intel/llm-scaler-platform:26.18.8.2 oneAPI 2026.0 + Ubuntu

Critical env vars

export ONEAPI_DEVICE_SELECTOR="*:gpu"        # NOT level_zero:N (breaks triton FLA probe)
export TRITON_INTEL_DEVICE_ARCH=20.2.0        # B70's EXACT IP (bmg-g21=20.1.0 corrupts output)

Do NOT use

  • GGUF / EXL2 / Ollama-only models (no GGUF kernels on XPU)
  • MTP/speculative decode (XPU GDN-kernel gap on v0.21)
  • TRITON_INTEL_DEVICE_ARCH=bmg-g21 (wrong stepping → silent garbage)

Quick start

1. Build the image (on aether-node-1)

cd /home/sjenkins/repos/HomeAsCode/infra/vllm-b70
docker build -t localhost:5000/ai/vllm-b70:0.21.0-oneapi2026 -f Dockerfile .

Build time: ~15-20 minutes (uses pre-built PyTorch XPU wheel, no from-source build).

2. Push to registry

docker push localhost:5000/ai/vllm-b70:0.21.0-oneapi2026

3. Create HF cache PVC

kubectl apply -f hf-cache-pvc.yaml

4. Deploy a model

# Deploy the official 35B-A3B baseline
kubectl apply -f deployments/35b-a3b-official.yaml

# Or use the deploy script:
./scripts/deploy-vllm-b70.sh Qwen/Qwen3.6-35B-A3B 4

5. Monitor

kubectl -n ai logs -f deployment/vllm-b70-qwen36-35b-a3b
kubectl -n ai get pods -l app=vllm-b70 --watch

Validation order

  1. 35B-A3B official (deployments/35b-a3b-official.yaml) — correctness baseline
  2. 27B official (deployments/27b-official.yaml) — hybrid model baseline
  3. 27B INT4 (deployments/27b-int4.yaml) — practical INT4 on 2 cards
  4. 27B uncensored (deployments/27b-uncensored.yaml) — Huihui abliterated
  5. 35B-A3B uncensored (deployments/35b-a3b-uncensored.yaml) — Huihui abliterated

Health check

kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/health

Test API

# Chat completion
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "Qwen/Qwen3.6-35B-A3B",
      "messages": [{"role": "user", "content": "What is the capital of France?"}],
      "max_tokens": 64
    }'

# List models
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/v1/models

Resource allocation

Model TP GPUs Memory/card Notes
Qwen3.6-35B-A3B (BF16) 4 4 ~8 GB MoE, ~3B active params
Qwen3.6-27B (BF16) 4 4 ~13 GB Hybrid GDN+attention
Qwen3.6-27B INT4 2 2 ~9 GB AutoRound compressed-tensors
Huihui 27B (BF16) 4 4 ~13 GB Uncensored, safetensors
Huihui 35B-A3B (BF16) 4 4 ~8 GB Uncensored, safetensors

Fallback image

If the from-source build fails or is too slow:

intel/llm-scaler-vllm:0.14.0-b8.3

This is Intel's older fork — easier but not the most current upstream vLLM.

Troubleshooting

torch.xpu sees 0 devices

  • Check /dev/dri is mounted in the container
  • Verify ONEAPI_DEVICE_SELECTOR="*:gpu" is set
  • Run xpu-smi on the host to confirm GPU visibility

"Backends mismatch" on libsycl

  • Ensure torch was built against system oneAPI 2026.0 (libsycl.so.9)
  • The prebuilt torch wheel bundles 2025.3 (libsycl.so.8) — incompatible with triton-xpu

Silent garbage output after ~35 requests

  • TRITON_INTEL_DEVICE_ARCH is set to bmg-g21 (IP 20.1.0) instead of 20.2.0
  • The B70 silicon is IP 20.2.0; wrong stepping causes silent mis-computation under torch.compile

Two XPU processes running simultaneously

  • Concurrent torch/vLLM processes race the shared triton/NEO compile cache
  • Kill all other vLLM pods before starting a new one

References