Large Language Models

WaferLLM: Large Language Model Inference at Wafer Scale

WaferLLM introduces the first wafer-scale Large Language Model inference system, achieving up to 200× higher accelerator utilization and 10-20× speedups over GPU clusters. The system leverages a novel PLMR model and introduces MeshGEMM/MeshGEMV operations optimized for wafer-scale architectures with hundreds of thousands of AI cores.

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai

WaferLLM: Large Language Model Inference at Wafer Scale

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

MoE-CAP introduces a comprehensive benchmark for evaluating sparse Mixture-of-Experts systems across three key dimensions: Cost, Accuracy, and Performance. The benchmark reveals fundamental trade-offs in MoE deployments and proposes sparsity-aware metrics (S-MBU and S-MFU) along with CAP Radar Diagrams to help practitioners make informed deployment decisions for large-scale MoE systems.

Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai