Yeqi Huang
Yeqi Huang
Home
Publications
Contact
Light
Dark
Automatic
Large Language Models
WaferLLM: Large Language Model Inference at Wafer Scale
WaferLLM introduces the first wafer-scale Large Language Model inference system, achieving up to 200× higher accelerator utilization and 10-20× speedups over GPU clusters. The system leverages a novel PLMR model and introduces MeshGEMM/MeshGEMV operations optimized for wafer-scale architectures with hundreds of thousands of AI cores.
Congjie He
,
Yeqi Huang
,
Pei Mu
,
Ziming Miao
,
Jilong Xue
,
Lingxiao Ma
,
Fan Yang
,
Luo Mai
Code
OSDI 2025
GitHub
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
MoE-CAP introduces a comprehensive benchmark for evaluating sparse Mixture-of-Experts systems across three key dimensions: Cost, Accuracy, and Performance. The benchmark reveals fundamental trade-offs in MoE deployments and proposes sparsity-aware metrics (S-MBU and S-MFU) along with CAP Radar Diagrams to help practitioners make informed deployment decisions for large-scale MoE systems.
Yinsicheng Jiang
,
Yao Fu
,
Yeqi Huang
,
Ping Nie
,
Zhan Lu
,
Leyang Xue
,
Congjie He
,
Man-Kit Sit
,
Jilong Xue
,
Li Dong
,
Ziming Miao
,
Dayou Du
,
Tairan Xu
,
Kai Zou
,
Edoardo Ponti
,
Luo Mai
PDF
DOI
ArXiv
Cite
×