No description

Shell 67.3%
Dockerfile 23.5%
Go Template 9.2%

Find a file

sjenkins 86fbc6d985 Some checks failed / Build & Push (push) Has been cancelled Details chore: update entrypoint, add vllm-intel-b70 and vllm-intel-b70-fj charts		2026-06-07 22:18:12 -05:00
.forgejo/workflows	fix: add retry logic for large layer push timeouts	2026-06-05 04:02:18 -05:00
chart	chore: update entrypoint, add vllm-intel-b70 and vllm-intel-b70-fj charts	2026-06-07 22:18:12 -05:00
deployments	vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0	2026-06-03 11:48:37 -05:00
scripts	chore: update entrypoint, add vllm-intel-b70 and vllm-intel-b70-fj charts	2026-06-07 22:18:12 -05:00
vllm-intel-b70@0676ac7ae3	chore: update entrypoint, add vllm-intel-b70 and vllm-intel-b70-fj charts	2026-06-07 22:18:12 -05:00
vllm-intel-b70-fj@27d3ec3120	chore: update entrypoint, add vllm-intel-b70 and vllm-intel-b70-fj charts	2026-06-07 22:18:12 -05:00
.actrc	Remove Docker socket mount from .actrc	2026-06-04 21:38:40 -05:00
Dockerfile	vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0	2026-06-03 11:48:37 -05:00
hf-cache-pvc.yaml	vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0	2026-06-03 11:48:37 -05:00
README.md	vLLM B70 stack: Dockerfile, deployments, scripts for Intel Arc Pro B70 with vLLM 0.21.0	2026-06-03 11:48:37 -05:00

README.md

vLLM on Intel Arc Pro B70 — Kubernetes Deployment

Overview

Run upstream vLLM v0.21.0 on 4× Intel Arc Pro B70 (Battlemage / Xe2, 32GB each) for Qwen3.6 models — without Intel's older llm-scaler-vllm fork.

Stack

Component	Version	Notes
vLLM	0.21.0	Stock upstream, zero source patches
Python	3.12	From base image
oneAPI	2026.0	Bundled in base image
PyTorch XPU	2.11.0+xpu	Pre-built wheel from PyTorch XPU index (bundles its own libsycl)
vllm-xpu-kernels	0.1.8.2	Prebuilt wheel from GitHub release
Base image	`intel/llm-scaler-platform:26.18.8.2`	oneAPI 2026.0 + Ubuntu

Critical env vars

export ONEAPI_DEVICE_SELECTOR="*:gpu"        # NOT level_zero:N (breaks triton FLA probe)
export TRITON_INTEL_DEVICE_ARCH=20.2.0        # B70's EXACT IP (bmg-g21=20.1.0 corrupts output)

Do NOT use

GGUF / EXL2 / Ollama-only models (no GGUF kernels on XPU)
MTP/speculative decode (XPU GDN-kernel gap on v0.21)
TRITON_INTEL_DEVICE_ARCH=bmg-g21 (wrong stepping → silent garbage)

Quick start

1. Build the image (on aether-node-1)

cd /home/sjenkins/repos/HomeAsCode/infra/vllm-b70
docker build -t localhost:5000/ai/vllm-b70:0.21.0-oneapi2026 -f Dockerfile .

Build time: ~15-20 minutes (uses pre-built PyTorch XPU wheel, no from-source build).

2. Push to registry

docker push localhost:5000/ai/vllm-b70:0.21.0-oneapi2026

3. Create HF cache PVC

kubectl apply -f hf-cache-pvc.yaml

4. Deploy a model

# Deploy the official 35B-A3B baseline
kubectl apply -f deployments/35b-a3b-official.yaml

# Or use the deploy script:
./scripts/deploy-vllm-b70.sh Qwen/Qwen3.6-35B-A3B 4

5. Monitor

kubectl -n ai logs -f deployment/vllm-b70-qwen36-35b-a3b
kubectl -n ai get pods -l app=vllm-b70 --watch

Validation order

35B-A3B official (deployments/35b-a3b-official.yaml) — correctness baseline
27B official (deployments/27b-official.yaml) — hybrid model baseline
27B INT4 (deployments/27b-int4.yaml) — practical INT4 on 2 cards
27B uncensored (deployments/27b-uncensored.yaml) — Huihui abliterated
35B-A3B uncensored (deployments/35b-a3b-uncensored.yaml) — Huihui abliterated

Health check

kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/health

Test API

# Chat completion
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "Qwen/Qwen3.6-35B-A3B",
      "messages": [{"role": "user", "content": "What is the capital of France?"}],
      "max_tokens": 64
    }'

# List models
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
  curl -s http://localhost:8000/v1/models

Resource allocation

Model	TP	GPUs	Memory/card	Notes
Qwen3.6-35B-A3B (BF16)	4	4	~8 GB	MoE, ~3B active params
Qwen3.6-27B (BF16)	4	4	~13 GB	Hybrid GDN+attention
Qwen3.6-27B INT4	2	2	~9 GB	AutoRound compressed-tensors
Huihui 27B (BF16)	4	4	~13 GB	Uncensored, safetensors
Huihui 35B-A3B (BF16)	4	4	~8 GB	Uncensored, safetensors

Fallback image

If the from-source build fails or is too slow:

intel/llm-scaler-vllm:0.14.0-b8.3

This is Intel's older fork — easier but not the most current upstream vLLM.

Troubleshooting

torch.xpu sees 0 devices

Check /dev/dri is mounted in the container
Verify ONEAPI_DEVICE_SELECTOR="*:gpu" is set
Run xpu-smi on the host to confirm GPU visibility

"Backends mismatch" on libsycl

Ensure torch was built against system oneAPI 2026.0 (libsycl.so.9)
The prebuilt torch wheel bundles 2025.3 (libsycl.so.8) — incompatible with triton-xpu

Silent garbage output after ~35 requests

TRITON_INTEL_DEVICE_ARCH is set to bmg-g21 (IP 20.1.0) instead of 20.2.0
The B70 silicon is IP 20.2.0; wrong stepping causes silent mis-computation under torch.compile

Two XPU processes running simultaneously

Concurrent torch/vLLM processes race the shared triton/NEO compile cache
Kill all other vLLM pods before starting a new one

References

crazydart/vllm-b70 — upstream vLLM on B70 repo
vLLM XPU kernels
Intel llm-scaler

README.md Unescape Escape