报告题目:From Tokens to Topology: System-Level LLM Optimization for Latency, Throughput, and Cost in Production Workflows
报告人:陈怡然
报告地点: 42-320
报告时间:2026年4月7日(星期二)10:00-11:00

Abstract:
Modern LLM deployments are increasingly dominated by system-level bottlenecks rather than model quality alone: long-context prefilling, KV-cache memory traffic, multi-agent redundancy, and workflow-level critical-path latency. This talk presents a practical optimization blueprint that turns these bottlenecks into measurable gains in time-to-first-token (TTFT), end-to-end latency, throughput, and GPU cost, with an emphasis on training-free, plug-and-play methods compatible with production constraints.
We will highlight four complementary leverage points across the serving stack. First, we show how multi-dimensional sparsity can jointly accelerate prefilling and decoding by co-adapting token pruning and neuron sparsity (CoreMatching). Second, for multi-agent workloads where repeated processing of overlapping context creates quadratic prefilling overhead, we introduce online cross-context KV-cache communication (KVCOMM), which reuses shared context despite prefix divergence via prompt-adaptive offsetting to significantly reduce TTFT. Third, we break workflow-level topological barriers via context-level speculation (ACCORDION): downstream agents decode in parallel under partial parent context and verify/rollback on incremental commits, reducing end-to-end latency without degrading task performance. Finally, we discuss how these ideas generalize to emerging non-autoregressive generation such as diffusion LLMs, where “suffix scratchpad” redundancy can be removed using structured suffix dropout (DPad), producing dramatic speedups while preserving accuracy.
Biography:
Yiran Chen is the John Cocke Distinguished Professor of Electrical and Computer Engineering at Duke University. He serves as the Principal Investigator and Director of the NSF AI Institute for Edge Computing Leveraging Next Generation Networks (Athena) and Co-Director of the Duke Center for Computational Evolutionary Intelligence (DCEI). His research group focuses on innovations in emerging memory and storage systems, machine learning and neuromorphic computing, and edge computing. Dr. Chen has authored over 700 publications and holds 96 U.S. patents. His work has received widespread recognition, including two Test-of-Time Awards and 14 Best Paper/Poster Awards. He is the recipient of the IEEE Circuits and Systems Society’s Charles A. Desoer Technical Achievement Award and the IEEE Computer Society’s Edward J. McCluskey Technical Achievement Award. He also serves as the inaugural Editor-in-Chief of the IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI) and the founding Chair of the IEEE Circuits and Systems Society’s Machine Learning Circuits and Systems (MLCAS) Technical Committee. Dr. Chen is a Fellow of the AAAS, ACM, IEEE, and NAI, and a member of the European Academy of Sciences and Arts.