УДК 004.048
DOI: 10.36871/2618-9976.2025.12–2.012
Авторы
Сергей Александрович Ярушев,
Кандидат технических наук, директор центра перспективных исследований в искусственном интеллекте, Российский экономический университет им. Г. В. Плеханова
Иван Николаевич Петров,
Аспирант кафедры информатики, Российский экономический университет им. Г. В. Плеханова
Аннотация
Статья предлагает систематический обзор инженерных подходов к проектированию высоконагруженных систем инференса больших языковых моделей (LLM). Показано, что главные узкие места располагаются на уровне памяти и планирования вычислений: двухфазная природа генерации (prefill/decode), хранение KV-кэша, а также дисбаланс загрузки GPU при смешанных запросах. Рассматриваются алгоритмы и фреймворки, ставшие отраслевым стандартом: PagedAttention (vLLM) для страничной организации KV-кэша, FlashAttention/FlashAttention‑2 для IO-эффективного вычисления внимания, стратегии непрерывного батчинга и chunked prefill, механизмы приоритизации и вытеснения, методы ускорения декодирования (спекулятивное декодирование) и квантизация (SmoothQuant, AWQ). На основе анализа даются практические рекомендации по сочетанию методов (квантизация + speculative decoding + continuous batching) для получения кумулятивного ускорения без изменений архитектуры модели. Материал адресован разработчикам и архитекторам LLM-сервисов, стремящимся повысить throughput, снизить latency и оптимизировать стоимость инференса при длинных контекстах и высокой конкурентности.
Ключевые слова
LLM-инференс
KV-кэш
PagedAttention
FlashAttention
continuous batching
chunked prefill
speculative decoding
квантизация (AWQ, SmoothQuant)
vLLM
TensorRT-LLM
throughput
latency
Список литературы
[1] Agrawal A., Kedia N., Panwar A. et al. (2025) Efficient LLM Inference via Chunked Prefills. ACM SIGOPS Operating Systems Review, no. 59(2), pp. 9–16 DOI: 10.1145/3759441.3759444
[2] Cade D., Shen C., Liang E. et al. (2023) Continuous Batching: How to Increase LLM Inference Throughput by 23x. Anyscale. URL: https://www.anyscale.com/blog/continuous-batching-llm-inference.
[3] Chen C., Borgeaud S., Irving G. et al. (2023) Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318 [cs.CL]. URL: https://arxiv.org/abs/2302.01318.
[4] Chitty-Venkata K. T., Emani M., Vishwanath V. et al. (2024) LLM–InferenceBench: Inference Benchmarking of Large Language Models on AI Accelerators. arXiv:2411.00136 [cs.LG]. URL: https://arxiv.org/abs/2411.00136.
[5] (2022–2024) Dao-AILab/flash-attention: Fast and memory-efficient exact attention. URL: https://github.com/Dao-AILab/flash-attention.
[6] Dao T. (2023) FlashAttention‑2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG]. URL: https://arxiv.org/abs/2307.08691.
[7] Dao T., Fu D. Y. et al. (2022) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]. URL: https://arxiv.org/abs/2205.14135.
[8] Dubey A. et al. (2024) The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]. URL: https://arxiv.org/abs/2407.21783.
[9] Gholami A., Kim S., Dong Z. et al. (2022) A Survey of Quantization Methods for Efficient Neural Network Inference. URL: https://arxiv.org/abs/2103.13630.
[10] Gupta P., Yi J., Kiely P. (2025) How we built production-ready speculative decoding with TensorRT-LL M. Baseten. URL: https://www.baseten.co/blog/how-we-built-productionready-speculative-decoding-with-tensorrt-llm/.
[11] Jiang A. Q., Sablayrolles A., Mensch A. et al. (2023) Mistral 7B. arXiv:2310.06825 [cs.CL]. URL: https://arxiv.org/abs/2310.06825.
[12] Kleppmann M. (2017) Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media. URL: https://unidel.edu.ng/focelibrary/books/Designing%20Data-Intensive%20Applications%20The%20Big%20Ideas%20Behind%20Reliable,%20Scalable,%20and%20Maintainable%20Systems%20by%20Martin%20Kleppmann%20(z-lib.org).pdf.
[13] KV cache offloading. LLM Inference Handbook. URL: https://bentoml.com/llm/inference-optimization/kv-cache-offloading.
[14] Kwon W., Li Z., Zhuang S. et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180[cs.DC]. URL: https://arxiv.org/abs/2309.06180.
[15] Leviathan Y., Kalman M., Matias Y. (2023) Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192 [cs.CL]. URL: https://arxiv.org/abs/2211.17192.
[16] Li B., Jiang Y., Gadepally V. et al. (2024) LLM Inference Serving: Survey of Recent Advances and Opportunities. arXiv:2407.12391 [cs.CL]. URL: https://arxiv.org/abs/2407.12391.
[17] Lin J., Tang J., Tang H. et al. (2024) AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.LG]. URL: https://arxiv.org/abs/2306.00978.
[18] (2023) LLM Inference Performance Engineering: Best Practices. Databricks. URL: https://www.databricks.com/blog/llm-inference-performance-engineering-bestpractices.
[19] (2023) NVIDIA Corporation. TensorRT-LLM: Optimizing Inference on Large Language Models. URL: https://github.com/NVIDIA/TensorRT-LLM.
[20] Peng H., Wu K., Wei Y. et al. (2023) FP8-LM: Training FP8 Large Language Models. arXiv:2310.18313 [cs.LG]. URL: https://arxiv.org/abs/2310.18313.
[21] Pope R., Douglas S., Chowdhery A. et al. (2023) Efficiently Scaling Transformer Inference. arXiv:2211.05102 [cs.LG]. URL: https://arxiv.org/abs/2211.05102.
[22] Sevegnani K., Fiameni G., Uppal U. et al. (2025) Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training. NVIDIA Technical Blog. URL: https://developer.nvidia.com/blog/floating-point‑8‑an-introduction-to-efficient-lower-precision-aitraining/.
[23] Sheng Y., Zheng L., Yuan B. et al. (2023) FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG]. URL: https://arxiv.org/abs/2303.06865.
[24] Vaidya N., Comly N., DeLaere J. et al. (2023) NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs. NVIDIA Technical Blog. URL: https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-modelinference-on-nvidia-h100‑gpus/.
[25] Vaswani A., Shazeer N. et al. (2017) Attention Is All You Need. arXiv:1706.03762 [cs.CL]. URL: https://arxiv.org/abs/1706.03762.
[26] Verma S., Vaidya N. (2023) Mastering LLM Techniques: Inference Optimization. NVIDIA Developer Blog. URL: https://developer.nvidia.com/blog/mastering-llm-techniquesinference-optimization/.
[27] vLLM. URL: https://docs.vllm.ai/.
[28] (2023) vLLM Team. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. URL: https://github.com/vllm-project/vllm.
[29] Welcome to TensorRT LLM’s Documentation! TensorRT LLM. URL: https://nvidia.github.io/TensorRT-LLM/.
[30] Xia H., Yang Z., Dong Q. et al. (2024) Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. arXiv:2401.07851[cs.CL]. URL: https://arxiv.org/abs/2401.07851.
[31] Xiao G., Lin J., Seznec M. et al. (2022) SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL]. URL: https://arxiv.org/abs/2211.10438.
[32] Yu G., Jeong J. S., Kim G. et al. (2022) Orca: A Distributed Serving System for TransformerBased Generative Models / Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022). URL: https://www.usenix.org/conference/osdi22/presentation/yu.
[33] Zhou Z., Ning X., Hong K. et al. (2024) A Survey on Efficient Inference for Large Language Models. arXiv:2404.14294 [cs.CL]. URL: https://arxiv.org/abs/2404.14294.

