AIME: Adaptive Inference with Model Evolution for Efficient On-Device Large Language Model Serving

Published in IEEE ICDCS, 2025

In this work, we propose ACME, an adaptive model customization framework designed to deploy large Transformer-based models efficiently across heterogeneous devices in distributed systems. ACME addresses critical issues such as performance imbalances, energy inefficiency, and privacy concerns when deploying pre-trained models like ViT and BERT at the edge.

The system uses a bidirectional single-loop architecture that progressively customizes models in two phases: (1) backbone customization through Pareto-optimal architecture generation on cloud and edge servers, and (2) header refinement through neural architecture search (NAS) and personalized aggregation based on local data distributions.

This work is a collaboration among researchers at the College of Intelligence and Computing, Tianjin University.

Recommended citation: Ziming Dai, Yunfeng Zhao, Yuxuan Wang, Jinhui Xu, Jinhang Song, Chao Qiu, and Salman Avestimehr. "AIME: Adaptive Inference with Model Evolution for Efficient On-Device Large Language Model Serving." IEEE ICDCS 2025.