Qwen3-30B-A3B (MoE)
Mixture-of-Experts model — 30B total parameters, 3B active (128 experts), BF16. Validated on Ascend 910B3 at TP=2 and TP=4 (with expert parallel) using both vLLM-Ascend and MindIE.
TOC
Model identityValidated hardware × stackDeployBenchmark resultsFull open-loop data (all 22 columns)Model identity
Validated hardware × stack
This model has also been validated as W8A8 with multi-node Prefill/Decode (PD) disaggregation on 910B4 (cross-node KV transfer) — a separate, more complex topology. The recipes on this page are the single-node aggregated deployments.
Deploy
Self-contained ServingRuntime + InferenceService YAMLs:
MindIE requires root and a writable model volume. Keep
serving.kserve.io/readonly: "false" on the InferenceService and the root
securityContext in the asset. See the Modelcar permission modes in
Extend Inference Runtimes.
Benchmark results
Open-loop per-replica, replica=1. Saturation capacity (RPS/replica) by workload
(< 1 = the workload does not sustain one request/second/replica on this hardware/stack).
Rate-1 latency snapshot, Chat workload (TP=2):
Full open-loop data (all 22 columns)
Every rate level (1–9) × all four workloads × both engines × TP=2/TP=4, with TTFT / E2E / ITL / TPS at p90 / p95 / p99 / mean. Expand each section:
Full 22-column open-loop sweep — 910B3 × 2 · Qwen3-30B-A3B (MoE) · vllm-ascend v0.18.0-openeuler (EP+eager)
Chat 512/256 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Code 1024/1024 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
RAG 4096/512 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Long RAG 10240/1536 (capacity < 1 RPS/replica) — capacity 0: all 9 rate levels errored (requests queue past the TTFT timeout). No sustained throughput.
Full 22-column open-loop sweep — 910B3 × 2 · Qwen3-30B-A3B (MoE) · MindIE 2.2.RC1
Chat 512/256 — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Code 1024/1024 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
RAG 4096/512 — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Long RAG 10240/1536 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Full 22-column open-loop sweep — 910B3 × 4 · Qwen3-30B-A3B (MoE) · vllm-ascend v0.18.0-openeuler (EP+eager)
Chat 512/256 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Code 1024/1024 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
RAG 4096/512 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Long RAG 10240/1536 (capacity < 1 RPS/replica) — capacity 0: all 9 rate levels errored (requests queue past the TTFT timeout). No sustained throughput.
Full 22-column open-loop sweep — 910B3 × 4 · Qwen3-30B-A3B (MoE) · MindIE 2.2.RC1
Chat 512/256 — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Code 1024/1024 — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
RAG 4096/512 — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).
Long RAG 10240/1536 (capacity < 1 RPS/replica) — units: TTFT / E2E / ITL in ms, TPS in tok/s. * = past saturation (achieved RPS < target rate).