VLLM
Page created
New page
{{lowercase title}}
{{Short description|Open-source software for large language model inference}}
{{Use mdy dates|date=April 2026}}
{{Infobox software
| name = vLLM
| logo = vLLM.svg
| author = Sky Computing Lab
[[University of California, Berkeley|Cal Berkeley]]
| developer = vLLM contributors
| released = 2023
| programming language = [[Python (programming language)|Python]], [[CUDA]], [[C++]]
| genre = [[Large language model]] [[inference engine]]
| license = [[Apache License 2.0]]
| website = {{URL|https://vllm.ai}}
| repo = {{URL|https://github.com/vllm-project/vllm}}
}}
'''vLLM''' is an open-source software framework for inference and serving of [[large language model]]s and related [[multimodal model]]s. Originally developed at the [[University of California, Berkeley]]'s Sky Computing Lab, the project is centered on ''PagedAttention'', a [[memory management|memory-management]] method for [[Transformer (deep learning)|transformer]] [[Transformer (deep learning)#KV caching|key–value cache]]s, and supports features such as continuous batching, [[distributed computing|distributed]] inference, [[Large language model#Quantization|quantization]], and [[OpenAI]]-compatible [[application programming interface|APIs]].{{cite web |title=GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs |url=https://github.com/vllm-project/vllm |website=GitHub |publisher=GitHub, Inc. |access-date=April 22, 2026}}{{cite conference |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |conference=Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles |year=2023 |url=https://arxiv.org/abs/2309.06180 |access-date=April 22, 2026}}{{cite web |title=vLLM |url=https://pytorch.org/projects/vllm/ |website=PyTorch |publisher=PyTorch Foundation |access-date=April 22, 2026}} According to a project [[software maintainer|maintainer]], the "v" in vLLM originally referred to "virtual", inspired by [[virtual memory]].{{cite web |title=vLLM full name |url=https://github.com/vllm-project/vllm/issues/835 |website=GitHub |publisher=GitHub, Inc. |date=August 23, 2023 |access-date=April 22, 2026}}
== History ==
vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley. Its core ideas were described in the 2023 paper ''Efficient Memory Management for Large Language Model Serving with PagedAttention'',{{cite arXiv |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |eprint=2309.06180 |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |class=cs.LG |date=2023-09-12}} which presented the system as a [[High-throughput computing|high-throughput]] and [[Memory (computer)|memory]]-efficient serving engine for [[Large language model|large language model]]s.
In 2025, the [[PyTorch]] Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the [[University of California, Berkeley]] contributed vLLM to the [[Linux Foundation]] in July 2024.{{cite web |title=PyTorch Foundation Welcomes vLLM as a Hosted Project |url=https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/ |website=PyTorch |publisher=PyTorch Foundation |date=May 7, 2025 |access-date=April 22, 2026}}
In January 2026, ''[[TechCrunch]]'' reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.{{cite web |last=Temkin |first=Marina |title=Inference startup Inferact lands $150M to commercialize vLLM |url=https://techcrunch.com/2026/01/22/inference-startup-inferact-lands-150m-to-commercialize-vllm/ |website=TechCrunch |date=January 22, 2026 |access-date=April 22, 2026}}
== Architecture ==
According to its 2023 paper, vLLM was designed to improve the efficiency of [[large language model]] serving by reducing memory waste in the [[Transformer (deep learning)#KV caching|key–value cache]] used during [[Transformer (deep learning)|transformer]] inference. The paper introduced ''PagedAttention'', an algorithm inspired by [[virtual memory]] and [[paging]] techniques in [[operating system]]s, and described vLLM as using block-level memory management and request scheduling to increase [[throughput]] while maintaining similar [[Latency (engineering)|latency]].
The project documentation and repository describe support for continuous batching, chunked prefill, [[speculative decoding]], prefix caching, [[Large language model#Quantization|quantization]], and multiple forms of [[distributed computing|distributed]] inference and serving. [[PyTorch]] has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including [[Nvidia|NVIDIA]] and [[Advanced Micro Devices|AMD]] [[graphics processing unit|GPUs]], [[Tensor Processing Unit|Google TPUs]], [[AWS]] Trainium, and [[Intel]] processors.
== See also ==
* [[SGLang]]
* [[llama.cpp]]
* [[OpenVINO]]
* [[Open Neural Network Exchange]]
* [[Comparison of deep learning software]]
* [[Comparison of machine learning software]]
* [[Lists of open-source artificial intelligence software]]
== External links ==
* [https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.03.post1-py3 vLLM on NVIDIA NGC]
* [https://pytorch.org/projects/vllm/ vLLM project page at PyTorch]
== References ==
{{reflist}}
{{Short description|Open-source software for large language model inference}}
{{Use mdy dates|date=April 2026}}
{{Infobox software
| name = vLLM
| logo = vLLM.svg
| author = Sky Computing Lab
[[University of California, Berkeley|Cal Berkeley]]
| developer = vLLM contributors
| released = 2023
| programming language = [[Python (programming language)|Python]], [[CUDA]], [[C++]]
| genre = [[Large language model]] [[inference engine]]
| license = [[Apache License 2.0]]
| website = {{URL|https://vllm.ai}}
| repo = {{URL|https://github.com/vllm-project/vllm}}
}}
'''vLLM''' is an open-source software framework for inference and serving of [[large language model]]s and related [[multimodal model]]s. Originally developed at the [[University of California, Berkeley]]'s Sky Computing Lab, the project is centered on ''PagedAttention'', a [[memory management|memory-management]] method for [[Transformer (deep learning)|transformer]] [[Transformer (deep learning)#KV caching|key–value cache]]s, and supports features such as continuous batching, [[distributed computing|distributed]] inference, [[Large language model#Quantization|quantization]], and [[OpenAI]]-compatible [[application programming interface|APIs]].{{cite web |title=GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs |url=https://github.com/vllm-project/vllm |website=GitHub |publisher=GitHub, Inc. |access-date=April 22, 2026}}{{cite conference |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |conference=Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles |year=2023 |url=https://arxiv.org/abs/2309.06180 |access-date=April 22, 2026}}{{cite web |title=vLLM |url=https://pytorch.org/projects/vllm/ |website=PyTorch |publisher=PyTorch Foundation |access-date=April 22, 2026}} According to a project [[software maintainer|maintainer]], the "v" in vLLM originally referred to "virtual", inspired by [[virtual memory]].{{cite web |title=vLLM full name |url=https://github.com/vllm-project/vllm/issues/835 |website=GitHub |publisher=GitHub, Inc. |date=August 23, 2023 |access-date=April 22, 2026}}
== History ==
vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley. Its core ideas were described in the 2023 paper ''Efficient Memory Management for Large Language Model Serving with PagedAttention'',{{cite arXiv |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |eprint=2309.06180 |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |class=cs.LG |date=2023-09-12}} which presented the system as a [[High-throughput computing|high-throughput]] and [[Memory (computer)|memory]]-efficient serving engine for [[Large language model|large language model]]s.
In 2025, the [[PyTorch]] Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the [[University of California, Berkeley]] contributed vLLM to the [[Linux Foundation]] in July 2024.{{cite web |title=PyTorch Foundation Welcomes vLLM as a Hosted Project |url=https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/ |website=PyTorch |publisher=PyTorch Foundation |date=May 7, 2025 |access-date=April 22, 2026}}
In January 2026, ''[[TechCrunch]]'' reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.{{cite web |last=Temkin |first=Marina |title=Inference startup Inferact lands $150M to commercialize vLLM |url=https://techcrunch.com/2026/01/22/inference-startup-inferact-lands-150m-to-commercialize-vllm/ |website=TechCrunch |date=January 22, 2026 |access-date=April 22, 2026}}
== Architecture ==
According to its 2023 paper, vLLM was designed to improve the efficiency of [[large language model]] serving by reducing memory waste in the [[Transformer (deep learning)#KV caching|key–value cache]] used during [[Transformer (deep learning)|transformer]] inference. The paper introduced ''PagedAttention'', an algorithm inspired by [[virtual memory]] and [[paging]] techniques in [[operating system]]s, and described vLLM as using block-level memory management and request scheduling to increase [[throughput]] while maintaining similar [[Latency (engineering)|latency]].
The project documentation and repository describe support for continuous batching, chunked prefill, [[speculative decoding]], prefix caching, [[Large language model#Quantization|quantization]], and multiple forms of [[distributed computing|distributed]] inference and serving. [[PyTorch]] has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including [[Nvidia|NVIDIA]] and [[Advanced Micro Devices|AMD]] [[graphics processing unit|GPUs]], [[Tensor Processing Unit|Google TPUs]], [[AWS]] Trainium, and [[Intel]] processors.
== See also ==
* [[SGLang]]
* [[llama.cpp]]
* [[OpenVINO]]
* [[Open Neural Network Exchange]]
* [[Comparison of deep learning software]]
* [[Comparison of machine learning software]]
* [[Lists of open-source artificial intelligence software]]
== External links ==
* [https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.03.post1-py3 vLLM on NVIDIA NGC]
* [https://pytorch.org/projects/vllm/ vLLM project page at PyTorch]
== References ==
{{reflist}}