No description
- Shell 74.1%
- Dockerfile 25.9%
|
Some checks failed
/ Build & Push (push) Failing after 4s
The DinD runner needs access to the host's Docker socket for docker build and docker push to work. |
||
|---|---|---|
| .forgejo/workflows | ||
| deployments | ||
| scripts | ||
| .actrc | ||
| Dockerfile | ||
| hf-cache-pvc.yaml | ||
| README.md | ||
vLLM on Intel Arc Pro B70 — Kubernetes Deployment
Overview
Run upstream vLLM v0.21.0 on 4× Intel Arc Pro B70 (Battlemage / Xe2, 32GB each)
for Qwen3.6 models — without Intel's older llm-scaler-vllm fork.
Stack
| Component | Version | Notes |
|---|---|---|
| vLLM | 0.21.0 | Stock upstream, zero source patches |
| Python | 3.12 | From base image |
| oneAPI | 2026.0 | Bundled in base image |
| PyTorch XPU | 2.11.0+xpu | Pre-built wheel from PyTorch XPU index (bundles its own libsycl) |
| vllm-xpu-kernels | 0.1.8.2 | Prebuilt wheel from GitHub release |
| Base image | intel/llm-scaler-platform:26.18.8.2 |
oneAPI 2026.0 + Ubuntu |
Critical env vars
export ONEAPI_DEVICE_SELECTOR="*:gpu" # NOT level_zero:N (breaks triton FLA probe)
export TRITON_INTEL_DEVICE_ARCH=20.2.0 # B70's EXACT IP (bmg-g21=20.1.0 corrupts output)
Do NOT use
- GGUF / EXL2 / Ollama-only models (no GGUF kernels on XPU)
- MTP/speculative decode (XPU GDN-kernel gap on v0.21)
TRITON_INTEL_DEVICE_ARCH=bmg-g21(wrong stepping → silent garbage)
Quick start
1. Build the image (on aether-node-1)
cd /home/sjenkins/repos/HomeAsCode/infra/vllm-b70
docker build -t localhost:5000/ai/vllm-b70:0.21.0-oneapi2026 -f Dockerfile .
Build time: ~15-20 minutes (uses pre-built PyTorch XPU wheel, no from-source build).
2. Push to registry
docker push localhost:5000/ai/vllm-b70:0.21.0-oneapi2026
3. Create HF cache PVC
kubectl apply -f hf-cache-pvc.yaml
4. Deploy a model
# Deploy the official 35B-A3B baseline
kubectl apply -f deployments/35b-a3b-official.yaml
# Or use the deploy script:
./scripts/deploy-vllm-b70.sh Qwen/Qwen3.6-35B-A3B 4
5. Monitor
kubectl -n ai logs -f deployment/vllm-b70-qwen36-35b-a3b
kubectl -n ai get pods -l app=vllm-b70 --watch
Validation order
- 35B-A3B official (
deployments/35b-a3b-official.yaml) — correctness baseline - 27B official (
deployments/27b-official.yaml) — hybrid model baseline - 27B INT4 (
deployments/27b-int4.yaml) — practical INT4 on 2 cards - 27B uncensored (
deployments/27b-uncensored.yaml) — Huihui abliterated - 35B-A3B uncensored (
deployments/35b-a3b-uncensored.yaml) — Huihui abliterated
Health check
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
curl -s http://localhost:8000/health
Test API
# Chat completion
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 64
}'
# List models
kubectl -n ai exec -it deployment/vllm-b70-qwen36-35b-a3b -- \
curl -s http://localhost:8000/v1/models
Resource allocation
| Model | TP | GPUs | Memory/card | Notes |
|---|---|---|---|---|
| Qwen3.6-35B-A3B (BF16) | 4 | 4 | ~8 GB | MoE, ~3B active params |
| Qwen3.6-27B (BF16) | 4 | 4 | ~13 GB | Hybrid GDN+attention |
| Qwen3.6-27B INT4 | 2 | 2 | ~9 GB | AutoRound compressed-tensors |
| Huihui 27B (BF16) | 4 | 4 | ~13 GB | Uncensored, safetensors |
| Huihui 35B-A3B (BF16) | 4 | 4 | ~8 GB | Uncensored, safetensors |
Fallback image
If the from-source build fails or is too slow:
intel/llm-scaler-vllm:0.14.0-b8.3
This is Intel's older fork — easier but not the most current upstream vLLM.
Troubleshooting
torch.xpu sees 0 devices
- Check
/dev/driis mounted in the container - Verify
ONEAPI_DEVICE_SELECTOR="*:gpu"is set - Run
xpu-smion the host to confirm GPU visibility
"Backends mismatch" on libsycl
- Ensure torch was built against system oneAPI 2026.0 (libsycl.so.9)
- The prebuilt torch wheel bundles 2025.3 (libsycl.so.8) — incompatible with triton-xpu
Silent garbage output after ~35 requests
TRITON_INTEL_DEVICE_ARCHis set tobmg-g21(IP 20.1.0) instead of20.2.0- The B70 silicon is IP 20.2.0; wrong stepping causes silent mis-computation under torch.compile
Two XPU processes running simultaneously
- Concurrent torch/vLLM processes race the shared triton/NEO compile cache
- Kill all other vLLM pods before starting a new one
References
- crazydart/vllm-b70 — upstream vLLM on B70 repo
- vLLM XPU kernels
- Intel llm-scaler