#  @vllm_project vLLM vLLM posts on X about inference, llm, ai, native the most. They currently have [------] followers and [---] posts still getting attention that total [-------] engagements in the last [--] hours. ### Engagements: [-------] [#](/creator/twitter::1774187564276289536/interactions)  ### Mentions: [--] [#](/creator/twitter::1774187564276289536/posts_active)  ### Followers: [------] [#](/creator/twitter::1774187564276289536/followers)  ### CreatorRank: [-------] [#](/creator/twitter::1774187564276289536/influencer_rank)  ### Social Influence **Social category influence** [technology brands](/list/technology-brands) [stocks](/list/stocks) [finance](/list/finance) [social networks](/list/social-networks) [countries](/list/countries) [travel destinations](/list/travel-destinations) [gaming](/list/gaming) [exchanges](/list/exchanges) **Social topic influence** [inference](/topic/inference) #38, [llm](/topic/llm) #744, [ai](/topic/ai), [native](/topic/native) #404, [agentic](/topic/agentic), [strong](/topic/strong), [the first](/topic/the-first), [community](/topic/community), [we are](/topic/we-are), [pip](/topic/pip) **Top accounts mentioned or mentioned by** [@vllmproject](/creator/undefined) [@nvidia](/creator/undefined) [@deepseekai](/creator/undefined) [@kimimoonshot](/creator/undefined) [@huggingface](/creator/undefined) [@alibabaqwen](/creator/undefined) [@pytorch](/creator/undefined) [@nvidiaaidev](/creator/undefined) [@minimaxai](/creator/undefined) [@mistralai](/creator/undefined) [@redhatai](/creator/undefined) [@amd](/creator/undefined) [@ai_bridge_japan](/creator/undefined) [@zaiorg](/creator/undefined) [@aiatmeta](/creator/undefined) [@sparkycollier](/creator/undefined) [@aiatamd](/creator/undefined) [@intel](/creator/undefined) [@openai](/creator/undefined) [@redhat](/creator/undefined) **Top assets mentioned** [IBM (IBM)](/topic/ibm) [Alphabet Inc Class A (GOOGL)](/topic/$googl) [Uber Technologies, Inc. (UBER)](/topic/uber) ### Top Social Posts Top posts by engagements in the last [--] hours "One of the best part of vLLM is our amazing community Our issues and pull requests are ever increasing. We have members of the community helping closing stale issues. @hmellor_ helped closing more than [---] stales issues over the last month" [X Link](https://x.com/vllm_project/status/1777408164943782123) 2024-04-08T18:48Z [---] followers, [---] engagements "Speculative Decoding improves the end-to-end latency by using draft models or ngrams to propose multiple tokens before the large model verifies it. https://docs.vllm.ai/en/stable/models/spec_decode.html https://docs.vllm.ai/en/stable/models/spec_decode.html" [X Link](https://x.com/vllm_project/status/1801417095777030400) 2024-06-14T00:51Z [----] followers, [---] engagements "Multimodal is also getting more important vLLM now has OpenAI compatible API for vision models. While only LLaVA and LLaVA-NeXT is supported there are 5+ in-flight PRs adding new models. https://docs.vllm.ai/en/stable/models/vlm.html https://docs.vllm.ai/en/stable/models/vlm.html" [X Link](https://x.com/vllm_project/status/1801417099681874173) 2024-06-14T00:51Z [----] followers, [---] engagements "Join us on Monday September [--] at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team. https://lu.ma/87q3nvnh https://lu.ma/87q3nvnh" [X Link](https://x.com/vllm_project/status/1821615579398361597) 2024-08-08T18:32Z [----] followers, [----] engagements "Beyond the dedicated vLLM track at #RaySummit we are co-hosting a meet-and-greet with @AMD and @anyscalecompute on the training day Sept 30th to kick off the summit. This will be the first event that you can learn about AMD MI300X's performance on vLLM https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5" [X Link](https://x.com/vllm_project/status/1835756291803004933) 2024-09-16T19:03Z [----] followers, [----] engagements "@Roblox not only adopts vLLM to innovate AI enabled product but also actively contributse to the open source ecosystemโค "We have adopted vLLM as our primary inference engine for LLMs leveraging vLLMs high-performance capabilities to power AI applications across Roblox. Since moving to vLLM weve seen an almost 2x improvement in both latency and throughput and we currently serve approximately [--] billion tokens per week." "Our choice of vLLM aligns with our commitment to leveraging open-source and cutting-edge technologies that can scale efficiently to meet the demands of our vast user base and" [X Link](https://x.com/vllm_project/status/1836213865744580981) 2024-09-18T01:21Z [----] followers, [----] engagements "Join the vLLM team at #RaySummit (Oct [--] 2) for the first ever "The State of vLLM 2024" talk recap our progress over the last year and looking forward to the future by @zhuohan123 and @KuntaiDu . This talk will kick start the track of vLLM talks from @anyscalecompute @Roblox @neuralmagic @IBMResearch @Apple @Uber @intel @databricks and others Registration link: (use code AnyscalevLLM15 for 15% off) http://raysummit.anyscale.com http://raysummit.anyscale.com" [X Link](https://x.com/vllm_project/status/1837234559680979439) 2024-09-20T20:57Z [----] followers, [----] engagements "We are utilizing PyTorch as a narrow waist for hardware abstractions" [X Link](https://x.com/vllm_project/status/1837258745287692569) 2024-09-20T22:33Z [----] followers, [---] engagements "Currently both AMD GPU and Google TPU utilize PyTorch's native ops and custom backends" [X Link](https://x.com/vllm_project/status/1837258748089496006) 2024-09-20T22:33Z [----] followers, [---] engagements "vLLM contributor @KaichaoYou will speak at IBM TechXchange on Oct [--] in Las Vegas come and discuss anything about vLLM" [X Link](https://x.com/vllm_project/status/1838775761580503236) 2024-09-25T03:01Z [----] followers, [----] engagements "๐ We just created a vLLM developer Slack workspace to discuss features coordinate collaborations and bring the users community together. Looking forward to see you there http://slack.vllm.ai http://slack.vllm.ai" [X Link](https://x.com/vllm_project/status/1843446798742106611) 2024-10-08T00:22Z [----] followers, [----] engagements "๐คฉ checkout this blog from @awscloud about scaling Rufus which is powered by @vllm_project on Inferentia and Trainium Serving [--] million tokens a minute "Within each container an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput." "These choices allowed Rufus to scale up over [-----] Trainium and Inferentia chips across three Regions serving an average of [--] million tokens a minute while maintaining P99 less than [--] second latency to the first response for" [X Link](https://x.com/vllm_project/status/1844457057535263088) 2024-10-10T19:16Z [----] followers, [----] engagements "๐The amazing folks at @EmbeddedLLM and @HotAisle have performed extensive tuning and benchmarking of vLLM on @AMD's MI300X GPU showcase leading performance. With [---] GB HBM3 per GPU MI300X is a great choice for large models such as 405B. https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html" [X Link](https://x.com/vllm_project/status/1851346337595605491) 2024-10-29T19:32Z 18.7K followers, [----] engagements "We reached 30K GitHub stars For all the stargazers what do you want to see in vLLM in [----] ๐" [X Link](https://x.com/vllm_project/status/1857253839877144701) 2024-11-15T02:46Z [----] followers, [----] engagements "what would you like vLLM to build/fix in [----] :D" [X Link](https://x.com/vllm_project/status/1872050613196141011) 2024-12-25T22:43Z [----] followers, [----] engagements "The first vLLM meetup in [----] is happening in two weeks on January 22nd Wednesday with @googlecloud in SF We will talk about vLLM's performant V1 architecture Q1 roadmap Google Cloud's innovation around vLLM: networking Cloud Run Vertex and TPU https://lu.ma/zep56hui https://lu.ma/zep56hui https://lu.ma/zep56hui https://lu.ma/zep56hui" [X Link](https://x.com/vllm_project/status/1877085706621010419) 2025-01-08T20:11Z [----] followers, [----] engagements "We focus our benchmark on long generation workload. On 8xH200 we are seeing 40% increase in generation throughput from the optimized fp8 kernels and 3.4x enhancement from MLA. On TP8PP2 settings with H100 we see 26% increase from the fp8 kernels and 2.8x increase from MLA" [X Link](https://x.com/vllm_project/status/1885837176635797993) 2025-02-01T23:46Z [----] followers, [----] engagements "The throughput increase comes from memory savings. MLA offers 9.6x memory capacity for KV caches which increases the batch size. For example on 8xH200 we increased from [-----] tokens to [------] tokens which means batch size of [--] to 128" [X Link](https://x.com/vllm_project/status/1885837178854617592) 2025-02-01T23:46Z [----] followers, [----] engagements "@zjasper666 @deepseek_ai Here's the batch size [--] performance: https://x.com/vllm_project/status/1885837180960211349 However do note that MHA is faster for generation than MLA actually in low qps settings. This is a current limitation we are addressing. Take the two settings we have under a single request we see MHA is worse time-to-first-token (TTFT) but higher time-per-output-token (TPOT) https://t.co/LmZjLih1KJ https://x.com/vllm_project/status/1885837180960211349 However do note that MHA is faster for generation than MLA actually in low qps settings. This is a current limitation we are" [X Link](https://x.com/vllm_project/status/1885927169668640864) 2025-02-02T05:44Z [----] followers, [---] engagements "We are welcoming AIBrix to vLLM organization It is a battery-included vLLM Kubernetes serving stack developed by ByteDance. https://blog.vllm.ai/2025/02/21/aibrix-release.html https://blog.vllm.ai/2025/02/21/aibrix-release.html" [X Link](https://x.com/vllm_project/status/1893087260519588127) 2025-02-21T23:55Z [----] followers, [----] engagements "๐ Join us at the SF AIBrix & vLLM Meetup on June 18th at AWS SF GenAI Loft Learn from experts at ByteDance AWS Neuron and EKS. Discover AIBrix: a scalable cost-effective control plane for vLLM. Talks Q&A pizza and networking ๐๐ค https://lu.ma/ab2id296 https://lu.ma/ab2id296" [X Link](https://x.com/vllm_project/status/1929952542185886184) 2025-06-03T17:25Z 14.2K followers, [----] engagements "Cool to see @vllm_project used as part of @WhatsApp Trusted Execution Environment (TEE) Private Processing Paper here: https://ai.meta.com/static-resource/private-processing-technical-whitepaper https://ai.meta.com/static-resource/private-processing-technical-whitepaper" [X Link](https://x.com/vllm_project/status/1933684411180229048) 2025-06-14T00:34Z 14.2K followers, [----] engagements "Minimax M1 is one of the SOTA open weight model from @MiniMax__AI. Checkout how is it efficiently implemented in vLLM directly from the team https://blog.vllm.ai/2025/06/30/minimax-m1.html ๐ฅ Another strong open model with Apache [---] license this one from @MiniMax_AI - places in the top [--]. MiniMax-M1 is now live on the Text Arena leaderboard landing at #12. This puts it at equal ranking with Deepseek V3/R1 and Qwen [--] See thread to learn more about its https://t.co/j14gU5D37O https://blog.vllm.ai/2025/06/30/minimax-m1.html ๐ฅ Another strong open model with Apache [---] license this one from" [X Link](https://x.com/vllm_project/status/1939901253762555976) 2025-07-01T04:17Z 14.6K followers, 19K engagements "We started a channel in the vLLM Slack (join via and let's discuss tips and ways to address this http://slack.vllm.ai http://slack.vllm.ai" [X Link](https://x.com/vllm_project/status/1941973032031191500) 2025-07-06T21:30Z 14.6K followers, [----] engagements "Intern-S1 is supported in vLLM now thanks for the joint efforts between the vLLM team and the InternLM team @intern_lm โฅ The easy way: uv pip install vllm --extra-index-url vllm serve internlm/Intern-S1 --tensor-parallel-size [--] --trust-remote-code https://wheels.vllm.ai/nightly ๐Introducing Intern-S1 our most advanced open-source multimodal reasoning model yet ๐ฅณStrong general-task capabilities + SOTA performance on scientific tasks rivaling leading closed-source commercial models. ๐ฅฐBuilt upon a 235B MoE language model and a 6B Vision encoder. https://t.co/0htivKiv3k" [X Link](https://x.com/vllm_project/status/1949147022977843459) 2025-07-26T16:37Z 15.4K followers, [----] engagements "@QuixiAI @Kimi_Moonshot @huggingface @casper_hansen_ What's the problem you get on vllm Please file an issue to discuss" [X Link](https://x.com/vllm_project/status/1952767200080855298) 2025-08-05T16:22Z 16K followers, [---] engagements "Fun demo of gpt-oss on @vllm_project Want to see our open models in action Watch how gpt-oss builds a video gameusing tools step-by-step within chain-of-thought reasoning ๐พ๐ https://t.co/WNeV0cpwM2 Want to see our open models in action Watch how gpt-oss builds a video gameusing tools step-by-step within chain-of-thought reasoning ๐พ๐ https://t.co/WNeV0cpwM2" [X Link](https://x.com/vllm_project/status/1952836939243299186) 2025-08-05T20:59Z 16K followers, [----] engagements "@code_star The openai-harmony=0.1.0 should be hosted as part of wheel index this morning and it is a pre-release wheel. About an hour ago we also updated the wheel to use any version of openai-harmony available. https://wheels.vllm.ai/gpt-oss https://wheels.vllm.ai/gpt-oss" [X Link](https://x.com/vllm_project/status/1952869429211218313) 2025-08-05T23:08Z 16.1K followers, [---] engagements "@casper_hansen_ maybe you will be interested in verl team @verl_project already builds their RL pipeline on this token-in-token-out usage of vLLM. https://verl.readthedocs.io/en/latest/advance/agent_loop.html https://verl.readthedocs.io/en/latest/advance/agent_loop.html" [X Link](https://x.com/vllm_project/status/1954492361758957717) 2025-08-10T10:37Z 16.2K followers, [----] engagements "Join us for the vLLM Meetup in Singapore on Aug [--] [----] (Wednesday) 6-8:30 PM @SGInnovateWe will discuss efficient LLM inference with talks on: - @EmbeddedLLM on Latest vLLM advancements - @AMD on optimizing inference on Data Center GPUs - @WekaIO presenting vLLM + LMCache + SSD for high-performance KV cache - @ASTARsg MERaLiON team on deploying AudioLLM with vLLM+Ray for autoscaling and load balancing Followed by Q&A ๐ท networking. Spaces limitedRSVP at https://www.sginnovate.com/event/vllm-sg-meet https://www.sginnovate.com/event/vllm-sg-meet" [X Link](https://x.com/vllm_project/status/1954768340242796714) 2025-08-11T04:54Z 16.6K followers, [----] engagements "When you chat with Rufus on Amazon app it is powered by vLLM https://aws.amazon.com/blogs/machine-learning/how-amazon-scaled-rufus-by-building-multi-node-inference-using-aws-trainium-chips-and-vllm/ https://aws.amazon.com/blogs/machine-learning/how-amazon-scaled-rufus-by-building-multi-node-inference-using-aws-trainium-chips-and-vllm/" [X Link](https://x.com/vllm_project/status/1956116150259212619) 2025-08-14T22:10Z 16.6K followers, [----] engagements "We are co-hosting gpt-oss meetup in San Francisco next Wednesday 08/27 at 5:30pm together with @OpenAI and @ollama at @ycombinator office RSVP here https://lu.ma/gpt-oss https://lu.ma/gpt-oss" [X Link](https://x.com/vllm_project/status/1958758804008706101) 2025-08-22T05:11Z 17.2K followers, [----] engagements "@LinkedIn not only uses vLLM at massive scale but also actively contributes to the community checkout their wonderful blog: https://www.linkedin.com/blog/engineering/ai/how-we-leveraged-vllm-to-power-our-genai-applications https://www.linkedin.com/blog/engineering/ai/how-we-leveraged-vllm-to-power-our-genai-applications" [X Link](https://x.com/vllm_project/status/1960433919440146821) 2025-08-26T20:07Z 16.8K followers, [--] engagements "@casper_hansen_ @sgl_project There's a guide from the rednote team and we will work with team for upstreaming https://huggingface.co/rednote-hilab/dots.ocr#vllm-inference https://huggingface.co/rednote-hilab/dots.ocr#vllm-inference" [X Link](https://x.com/vllm_project/status/1965836724099232178) 2025-09-10T17:56Z 17.6K followers, [---] engagements "@thinkymachines is building a vLLM team You will be working with @woosuk_k and many great folks to build the worlds frontier inference engine At Thinking Machines our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested please DM me or @barret_zoph At Thinking Machines our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to" [X Link](https://x.com/vllm_project/status/1966255318050156704) 2025-09-11T21:39Z 17.8K followers, [----] engagements "Step [--] Top-k Selection For each query the indexer selects its top-2048 tokens in the context to attend to. (If context length [----] all tokens are included.)" [X Link](https://x.com/vllm_project/status/1972617281109925897) 2025-09-29T10:59Z 18.9K followers, [----] engagements "Step [--] Sparse MLA With selected token indices the FlashMLA sparse kernel performs efficient attention skipping irrelevant positions" [X Link](https://x.com/vllm_project/status/1972617283693604965) 2025-09-29T10:59Z 18.9K followers, [----] engagements "Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs and we cannot do it without the amazing team from NVIDIA. ๐ฃ We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM the open source ๐ฃ We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer" [X Link](https://x.com/vllm_project/status/1973238773090754735) 2025-10-01T04:09Z 19.6K followers, [----] engagements "@CatGodSandHive check out their paper for more details" [X Link](https://x.com/vllm_project/status/1974735221494104463) 2025-10-05T07:15Z 19K followers, [---] engagements "@ScentScai @OpenAI @nvidia @huggingface @lmstudio @devpost congrats" [X Link](https://x.com/vllm_project/status/1977914122105737288) 2025-10-14T01:47Z 19.6K followers, [---] engagements "Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying PyTorch and JAX under a single lowering path for amazing performance and flexibility" [X Link](https://x.com/vllm_project/status/1978852171522408614) 2025-10-16T15:54Z 19.6K followers, [--] engagements "We hear your voice For minimax in particular forces the lm head to be fp32 which resumes accuracy but takes a lot memory. We are experimenting to see if dynamically casting fp16/bf16 to fp32 in the kernel helps the accuracy of the logits. https://github.com/vllm-project/vllm/pull/19592 horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh https://github.com/vllm-project/vllm/pull/19592 horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh" [X Link](https://x.com/vllm_project/status/1942049771361038584) 2025-07-07T02:35Z 25.7K followers, [----] engagements "@winglian Thank you for the investigation. We will bump the version immediately so you can use vLLM's nightly wheel with PyTorch 2.7.1 and it will be the default for next release" [X Link](https://x.com/vllm_project/status/1945199079346036955) 2025-07-15T19:09Z 25.7K followers, [----] engagements "๐ฅ Step [--] support just landed in vLLM Both the 321B text-only and vision-language models are now fully supported โ Step [--] is a blazing-fast cost-effective VLM with MFA & AFD for efficient inferenceeven on modest GPUs. ๐ Up to [----] tok/sec/GPU We will optimize it further in the futureโก ๐ Deployment guide ๐๐ป #vLLM #Step3 #OpenSourceAI #VLM https://huggingface.co/stepfun-ai/step3/blob/main/docs/deploy_guidance.md ๐ Announcing Step 3: Our latest open-source multimodal reasoning model is here Get ready for a stronger faster & more cost-effective VLM ๐ต 321B parameters (38B active) optimized" [X Link](https://x.com/vllm_project/status/1950954138541711802) 2025-07-31T16:18Z 25.6K followers, [----] engagements "With v0.7.0 you can apply multiple compressors to a single model. This enables mixing formats like NVFP4 and FP8 giving you fine-grained control over quantization strategies and better handling of sensitive layers. Block FP8 (DeepSeekV3-style) is also supported" [X Link](https://x.com/vllm_project/status/1960432744854642986) 2025-08-26T20:02Z 28.3K followers, [----] engagements "MoE calibration is now more flexible with support for NVFP4 quantized experts. Llama4 models can now be quantized and run with vLLM using GPTQ or NVFP4. Includes support for WN16. Full release notes ๐ท https://github.com/vllm-project/llm-compressor/releases/tag/0.7.0 https://github.com/vllm-project/llm-compressor/releases/tag/0.7.0" [X Link](https://x.com/vllm_project/status/1960432746054214018) 2025-08-26T20:02Z 28.2K followers, [----] engagements "Step [--] Query & Key Projection Hidden states are projected into query/key space (with rotary embeddings). A new addition: per-head weights are also projected from hidden statesthese will reweight logits later" [X Link](https://x.com/vllm_project/status/1972617276097798273) 2025-09-29T10:59Z 26.6K followers, [----] engagements "@sundeep @IBMwatsonx @GroqInc @IBM @ArvindKrishna @robdthomas congrats is this a hardware plugin that will be open-sourcedโค" [X Link](https://x.com/vllm_project/status/1980299849175351416) 2025-10-20T15:47Z 23.2K followers, [----] engagements "๐คฏ Self improving FlashInfer kernels We look forward to the integration and make it shine on real world workloads. ๐ค Can AI optimize the systems it runs on ๐ Introducing FlashInfer-Bench a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving https://t.co/FU9JjJ1vmf ๐ค Can AI optimize the systems it runs on ๐ Introducing FlashInfer-Bench a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels" [X Link](https://x.com/vllm_project/status/1980721990538494297) 2025-10-21T19:44Z 25.3K followers, [----] engagements "@bdambrosio yes it's supported you can pass in token ids directly in v1/completions endpoint" [X Link](https://x.com/vllm_project/status/1981043614492086574) 2025-10-22T17:02Z 23.9K followers, [----] engagements "The probability sampled from the model should be exactly the same whether you're running a single request batching multiple requests together or prefilling with generated tokens" [X Link](https://x.com/vllm_project/status/1981088864963101144) 2025-10-22T20:02Z 24.1K followers, [----] engagements "Implementation took [--] main approaches: [--] Building custom operator implementations [--] Overriding and monkey-patching execution throughout the stack [--] Carefully tweaking some existing backends Each piece was crucial to achieving batch-level bitwise invariance" [X Link](https://x.com/vllm_project/status/1981088866133299230) 2025-10-22T20:02Z 24.1K followers, [---] engagements "Testing Strategy: Split tests into prefill vs. prefill+decode (vLLM has different code paths). Ran mix of big/small requests at bs=1 and bs=N checking logprobs are exactly the same. No tolerance for approximation" [X Link](https://x.com/vllm_project/status/1981088873355890692) 2025-10-22T20:02Z 23.8K followers, [---] engagements "Debugging Benefits: With proper batch lane tracking we could check for exact bitwise equivalence to isolate non-invariant execution. Best part Batch-invariance makes debugging SO much easier and more rigorous going forward ๐ฏ" [X Link](https://x.com/vllm_project/status/1981088874731606424) 2025-10-22T20:02Z 24K followers, [----] engagements "๐๐๐ DeepSeek-OCR is now officially supported in upstream vLLM with the custom logits processor as a core component of the model release The integration has been validated by the original author and here's a dedicated recipe to get started. Check it out https://t.co/xEJ5DDz8zl DeepSeek-OCR is now officially supported in upstream vLLM with the custom logits processor as a core component of the model release The integration has been validated by the original author and here's a dedicated recipe to get started. Check it out https://t.co/xEJ5DDz8zl" [X Link](https://x.com/vllm_project/status/1981167769883463704) 2025-10-23T01:16Z 26K followers, [----] engagements "vLLM's sleep mode ( see ) can be leveraged to implement a similar system to Aegaeon to cut down the GPU cost of model market like @huggingface ๐ In fact Aegaeon is also based on vLLM https://x.com/vllm_project/status/1983069225460650103 What DeepSeek did to AI software Alibaba Cloud is doing to AI hardware. By enabling a GPU to serve multiple LLMs at the same time Alibaba is able to cut the number of (Nvidia) GPUs by whopping 82%. Its like Japanese cars v. American gas-guzzlers in the 1970s. Aegaeon is https://t.co/1Y9zlY8ps5 https://x.com/vllm_project/status/1983069225460650103 What" [X Link](https://x.com/vllm_project/status/1983077659547709870) 2025-10-28T07:45Z 24.5K followers, [----] engagements "Our first official vLLM Meetup is coming to Europe on Nov [--] Meet vLLM committers @mgoin_ @tms_jr Thomas Parnell + speakers from @RedHat_AI @IBM @MistralAI. Topics: vLLM updates quantization Mistral+vLLM hybrid models distributed inference https://luma.com/0gls27kb https://luma.com/0gls27kb" [X Link](https://x.com/vllm_project/status/1983291257549140414) 2025-10-28T21:54Z 25.2K followers, [----] engagements "๐ Congrats to @Zai_org on the release of their latest visual-text compression framework ๐ Don't miss the amazing official end-to-end demo - powered by vLLMs efficient model deployment for lightning-fast performanceโก Want to deploy your own Check out our model usage guide for more tips https://docs.vllm.ai/projects/recipes/en/latest/GLM/Glyph.html Glyph: Scaling Context Windows via Visual-Text Compression Paper: https://t.co/oxDsNJZXRz Weights: https://t.co/IYn2pAQjAn Repo: https://t.co/lFW7ajnHCk Glyph is a framework for scaling the context length through visual-text compression. It renders" [X Link](https://x.com/vllm_project/status/1983346316236542266) 2025-10-29T01:33Z 24.5K followers, [----] engagements "๐ฅ Following our big announcement heres the full vLLM takeover at Ray Summit [----] ๐ San Francisco Nov [--] Hosted by @anyscalecompute Get ready for deep dives into high-performance inference unified backends prefix caching MoE serving and large-scale orchestration. Save this schedule ๐ ๐ November [--] vLLM State + Scaling Track ๐ค Simon Mo State of vLLM [----] โก Kaustubh Rao FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels. ๐ข Deepak Chandramouli Ankur Goenka Rehan Durrani: vLLM in Apple Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal" [X Link](https://x.com/vllm_project/status/1983520933701996790) 2025-10-29T13:06Z 25.5K followers, [----] engagements "๐Excited to team up with @NVIDIAAIDev to bring Nemotron Nano [--] VL to vLLM - a multimodal model powered by a hybrid TransformerMamba language backbone built for video understanding and document intelligenceโจ Full post https://blog.vllm.ai/2025/10/31/run-multimodal-reasoning-agents-nvidia-nemotron.html https://blog.vllm.ai/2025/10/31/run-multimodal-reasoning-agents-nvidia-nemotron.html" [X Link](https://x.com/vllm_project/status/1984334926972592193) 2025-10-31T19:01Z 26K followers, [----] engagements "Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from read on Thank you @haoailab for pushing the idea forward. ๐ฅ New Blog: Disaggregated Inference: [--] Months Later [--] months in LLM inference feels like a new Moores Law cycle but this time not just 2x per year: ๐ธ Serving cost 10100x ๐ Throughput 10x โก Latency 5x A big reason Disaggregated Inference. From DistServe our ๐ฅ New Blog: Disaggregated Inference: [--] Months Later [--] months in LLM inference feels like a new Moores Law cycle but this time not just 2x per year: ๐ธ Serving" [X Link](https://x.com/vllm_project/status/1985761953432944893) 2025-11-04T17:31Z 25.7K followers, [----] engagements "๐ Learn how to deploy vLLM on NVIDIA DGX Spark the right way NVIDIA just published a detailed best practices guide for running high-throughput inference with vLLM including multi-node setups and optimized Docker builds. @NVIDIAAIDev ๐ Dive in: #vLLM #NVIDIA #DGXSpark #LLMInference #AIInfrastructure https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview" [X Link](https://x.com/vllm_project/status/1986049283339243821) 2025-11-05T12:33Z 25.7K followers, [----] engagements "Looking forward to integrate @perplexity_ai's fast communication kernels in vLLM Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well) Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)" [X Link](https://x.com/vllm_project/status/1986119917297672245) 2025-11-05T17:14Z 25.3K followers, [----] engagements "Join us Wednesday November [--] in Palo Alto to deep dive into vLLM and @AMD. Hear from @Meta as well on how are they using the AMD GPUs vLLM. vLLM + Meta + AMD Meetup: https://t.co/txkndkuLF6 vLLM + Meta + AMD Meetup: https://t.co/txkndkuLF6" [X Link](https://x.com/vllm_project/status/1986496366353522899) 2025-11-06T18:10Z 25.5K followers, [----] engagements "vLLM's plan for omni-modality This is a great achievement from the SGLang team I got immediately asked by a few people what vLLM's plan is. Multimodal generation with omni models is something we have been working on with the community and model vendors. As a sneak peek Hunyuan-image [---] (that beats This is a great achievement from the SGLang team I got immediately asked by a few people what vLLM's plan is. Multimodal generation with omni models is something we have been working on with the community and model vendors. As a sneak peek Hunyuan-image [---] (that beats" [X Link](https://x.com/vllm_project/status/1986930498346885614) 2025-11-07T22:55Z 25.7K followers, [----] engagements "Thanks to @github for spotlighting vLLM in the Octoverse [----] report one of the fastest-growing open-source AI projects this year. ๐ Top OSS by contributors ๐ Fastest-growing by contributors ๐ฑ Attracting the most first-time contributors Trusted by leading open model communities and industry partners including NVIDIA Meta Red Hat DeepSeek Qwen Moonshot and others vLLM has become a preferred engine for efficient LLM inference. With almost 63K stars and [----] contributors this growth belongs to the community. Together were building an easier faster and cheaper LLM serving for everyone. #vLLM" [X Link](https://x.com/vllm_project/status/1988619715024130203) 2025-11-12T14:47Z 26.1K followers, 10.2K engagements "๐ vLLM Party @ NeurIPS [----] San Diego Were teaming up with @RedHat_AI @AMD and @IBM to host an open-source community party at NeurIPS No agenda. No slides. Just: tacos drinks arcade games unlimited tokens and conversations with builders across the AI infra ecosystem. ๐ Dec [--] ๐ 18:0021:00 GMT-8 ๐ Coin-Op Game Room Register: Come for the tacos stay for the pinball. ๐น๐ฅ #vLLM #NeurIPS2025 #AIInfra #OpenSource https://luma.com/alpuytvr https://luma.com/alpuytvr" [X Link](https://x.com/vllm_project/status/1991128582874394981) 2025-11-19T12:57Z 26.2K followers, [----] engagements "๐ Docker Model Runner now integrates vLLM High-throughput LLM inference is now available with the same Docker workflow devs already use. Native safetensors support Automatic routing between llama.cpp (GGUF) and vLLM (safetensors) Works from laptops clusters with one CLI Bringing easy fast and affordable LLM serving to the @Docker ecosystem. Checkout Blog: https://blog.vllm.ai/2025/11/19/docker-model-runner-vllm.html Youve got a model running locally. Now make it repeatable shareable and scalable. Join our live webinar on Docker Model Runner to run manage and scale models with ease plus" [X Link](https://x.com/vllm_project/status/1991534839464620511) 2025-11-20T15:51Z 26.2K followers, 53.2K engagements "Need to customize vLLM Don't fork it. ๐ vLLM's plugin system lets you inject surgical modifications without maintaining a fork or monkey-patching entire modules. Blog by Dhruvil Bhatt from AWS SageMaker ๐ Why plugins forks: vLLM releases every [--] weeks with 100s of PRs merged Forks require constant rebasing & conflict resolution Monkey patches break on every vLLM upgrade How it works: Use VLLMPatchTargetClass for precise class-level mods Register via vllm.general_plugins entry point Control patches with env vars (VLLM_CUSTOM_PATCHES) Version-guard with min_vllm_version decorator Example: Add" [X Link](https://x.com/vllm_project/status/1991886835724013787) 2025-11-21T15:10Z 26.2K followers, 21.6K engagements "๐Congrats on the launch Speculators + vLLM brings a clean standardized path for speculative decoding making it easier to move from draft models to real production workloads. Excited to see how the community uses this to build faster more efficient inference systems. Were open-sourcing a set of high quality speculator models for Llamas Qwens and gpt-oss on Hugging Face. In real workloads you can expect [---] to 2.5x speedups and sometimes more than 4x. Heres how this fits into the bigger story for speculative decoding. A thread ๐งต: https://t.co/YCmbBqzqf6 Were open-sourcing a set of high" [X Link](https://x.com/vllm_project/status/1992446929897394245) 2025-11-23T04:15Z 26.5K followers, 20.1K engagements "๐ vLLM Talent Pool is Open As LLM adoption accelerates vLLM has become the mainstream inference engine used across major cloud providers (AWS Google Cloud Azure Alibaba Cloud ByteDance Tencent Baidu) and leading model labs (DeepSeek Moonshot Qwen). To meet the strong demand from top companies the vLLM community is now collecting resumes year-round and helping with referrals (internships & full-time). If you have experience in any of the following areas wed love to hear from you: RL frameworks & algorithms for LLMs Tool calling MCP Harmony format OpenAI/Anthropic API Structured output /" [X Link](https://x.com/vllm_project/status/1992979748067357179) 2025-11-24T15:32Z 30.3K followers, 73.7K engagements "Great to see more compact OCR models landing in open source lately and HunyuanOCR is a standout: a strong versatile 1B model with impressive real-world coverage. The vLLM community already provides Day-0 support so you can try it right away: https://docs.vllm.ai/projects/recipes/en/latest/Tencent-Hunyuan/HunyuanOCR.html We are thrilled to open-source HunyuanOCR an expert end-to-end OCR model built on Hunyuan's native multimodal architecture and training strategy. This model achieves SOTA performance with only [--] billion parameters significantly reducing deployment costs. โกBenchmark Leader:" [X Link](https://x.com/vllm_project/status/1993291230558716237) 2025-11-25T12:10Z 26.5K followers, 13.8K engagements "FP8 RL on consumer GPUs just got a boost๐ฅ Thrilled to team up with @UnslothAI and TorchAO to bring FP8 GRPO to vLLM: [---] faster RL inference 60% less VRAM [--] longer context and Qwen3-1.7B fitting in 5GB VRAM You can now run FP8 reinforcement learning on consumer GPUs Try DeepSeek-R1s FP8 GRPO at home using only a 5GB GPU. Qwen3-1.7B fits in 5GB VRAM. We collabed with PyTorch to make FP8 RL inference [---] faster. Unsloth: 60% less VRAM [--] longer context. https://t.co/YiBAUb8hz5 https://t.co/X4J6VmRMjY You can now run FP8 reinforcement learning on consumer GPUs Try DeepSeek-R1s FP8 GRPO at home" [X Link](https://x.com/vllm_project/status/1993363216358162579) 2025-11-25T16:56Z 26.6K followers, 22.1K engagements "Love this: a community contributor built vLLM Playground to make inferencing visible interactive and experiment-friendly. From visual config toggles to automatic command generation from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration it brings the whole vLLM lifecycle into one unified UX. Huge kudos to micyang for this thoughtful polished contribution. ๐ https://github.com/micytao/vllm-playground https://github.com/micytao/vllm-playground" [X Link](https://x.com/vllm_project/status/1995026681153876314) 2025-11-30T07:06Z 26.6K followers, 24.7K engagements "More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text image audio and video. Today were releasing vLLM-Omni: an open-source framework that extends vLLMs easy fast and cost-efficient serving to omni-modality models like Qwen-Omni and Qwen-Image with disaggregated stages for different model modules and components. If you know how to use vLLM you already know how to use vLLM-Omni. Blogpost: Code: Docs: Examples: https://github.com/vllm-project/vllm-omni/tree/main/examples" [X Link](https://x.com/vllm_project/status/1995566791234629989) 2025-12-01T18:52Z 26.8K followers, 26.4K engagements "Currently only supports Qwen-Omni and Qwen-Image and this is just the beginning. More models are coming. https://huggingface.co/Qwen/Qwen-Image https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text image audio and video. Today were releasing vLLM-Omni: an open-source framework that extends vLLMs easy fast and cost-efficient https://huggingface.co/Qwen/Qwen-Image https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct More inference workloads now mix" [X Link](https://x.com/vllm_project/status/1995696085977973004) 2025-12-02T03:26Z 29.7K followers, 11.2K engagements "๐ Congratulations to the Mistral team on launching the Mistral [--] family Were proud to share that @MistralAI @NVIDIAAIDev @RedHat_AI and vLLM worked closely together to deliver full Day-0 support for the entire Mistral [--] lineup. This collaboration enabled: NVFP4 (llm-compressor) optimized checkpoints Sparse MoE kernels for Mistral Large [--] Prefill/decode disaggregated serving Multimodal + long-context inference Efficient inference on A100 / H100 / Blackwell ๐ A huge thank-you to @MistralAI @NVIDIAAIDev and @RedHat_AI for the strong partnership and engineering effort that made Day-0" [X Link](https://x.com/vllm_project/status/1995890057224618154) 2025-12-02T16:17Z 26.8K followers, 31.7K engagements "LLM agents are powerful but can be slow at scale. @Snowflake's model-free SuffixDecoding from Arctic Inference now runs natively in vLLM beating tuned N-gram speculation across concurrency levels while keeping CPU and memory overhead in check. Quick Start in vLLM: https://docs.vllm.ai/en/stable/features/spec_decode/#speculating-using-suffix-decoding Suffix Decoding is at #NeurIPS2025 as a ๐ spotlight It accelerates LLM inference for coding agents and RL. We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang). Talk to @GabrieleOliaro or me at poster #816" [X Link](https://x.com/vllm_project/status/1996130115856859461) 2025-12-03T08:11Z 26.5K followers, 11.2K engagements "๐ค Proud to share the first production-ready vLLM plugin for Gaudi developed in close collaboration with the Intel team and fully aligned with upstream vLLM. ๐ง This release is validated and ready for deployment with support for the latest vLLM version coming soon. ๐ The @intel Gaudi team also completely revamped the plugin documentation to make onboarding even smoother. ๐ Release: ๐ Docs: #vLLM #Gaudi #Intel #OpenSource #AIInfra https://vllm-gaudi.readthedocs.io/ https://github.com/vllm-project/vllm-gaudi/releases/tag/v0.11.2 https://vllm-gaudi.readthedocs.io/" [X Link](https://x.com/vllm_project/status/1996207672245518782) 2025-12-03T13:19Z 26.7K followers, [----] engagements "๐ vLLM now offers an optimized inference recipe for DeepSeek-V3.2. โ Startup details Run vLLM with DeepSeek-specific components: --tokenizer-mode deepseek_v32 --tool-call-parser deepseek_v32 ๐งฐ Usage tips Enable thinking mode in vLLM: extra_body="chat_template_kwargs":"thinking": True Use reasoning instead of reasoning_content ๐ Special thanks to @TencentCloud for compute and engineering support. ๐ Full recipe (including how to properly use the thinking with tool calls feature): #vLLM #DeepSeek #Inference #ToolCalling #OpenSource" [X Link](https://x.com/vllm_project/status/1996760535908642986) 2025-12-05T01:56Z 26.9K followers, 31.9K engagements "๐ขvLLM v0.12.0 is now available. For inference teams running vLLM at the center of their stack this release refreshes the engine extends long-context and speculative decoding capabilities and moves us to a PyTorch 2.9.0 / CUDA [----] baseline for future work" [X Link](https://x.com/vllm_project/status/1996947370588946861) 2025-12-05T14:18Z 26.8K followers, [----] engagements "๐Congrats to the @Zai_org team on the launch of GLM-4.6V and GLM-4.6V-Flash with day-0 serving support in vLLM Recipes for teams who want to run them on their own GPUs. GLM-4.6V focuses on high-quality multimodal reasoning with long context and native tool/function calling while GLM-4.6V-Flash is a 9B variant tuned for lower latency and smaller-footprint deployments; our new vLLM Recipe ships ready-to-run configs multi-GPU guidance and production-minded defaults. If youre building inference services and want GLM-4.6V in your stack start here:" [X Link](https://x.com/vllm_project/status/1998019338033680574) 2025-12-08T13:18Z 27.3K followers, 45.6K engagements "Congrats to the @MistralAI team on the launch of Devstral [--] ๐ vLLM now delivers Day-0 support for the Devstral [--] Instruct models optimized for agentic coding deep codebase exploration and multi-file editing at scale. Feel free to reach out ๐ Introducing the Devstral [--] coding model family. Two sizes both open source. Also meet Mistral Vibe a native CLI enabling end-to-end automation. ๐งต Introducing the Devstral [--] coding model family. Two sizes both open source. Also meet Mistral Vibe a native CLI enabling end-to-end automation. ๐งต" [X Link](https://x.com/vllm_project/status/1998428798891765926) 2025-12-09T16:25Z 26.9K followers, 19.3K engagements "Low-bit LLM quantization doesnt have to mean painful accuracy trade-offs or massive tuning runs. Intel's AutoRound PTQ algorithm is now integrated into LLM Compressor producing W4A16 compressed-tensor checkpoints you can serve directly with vLLM across Intel Xeon Gaudi Arc GPUs and more. Huge thanks to the @intel Neural Compressor @RedHat_AI optimization LLM Compressor and vLLM community for making this integration happen. Repo: Blog: https://blog.vllm.ai/2025/12/09/intel-autoround-llmc.html https://github.com/vllm-project/llm-compressor" [X Link](https://x.com/vllm_project/status/1998710451312771532) 2025-12-10T11:04Z 26.9K followers, 12.2K engagements "vLLM was mentioned in about half of the PyTorch Conference [----] talks (53/117) Several months ago when the @PyTorch conference agenda was out we noticed that there would be [--] dedicated talks about vLLM. After the PyTorch conference we find that actually about half of the talks mentioned vLLM Check out for more details. These mentions can be roughly categorized as: - vLLM itself as a hosted project in PyTorch foundation including vLLM general introduction vLLM V1 on AMD GPUs vLLM-triton-backend - Usage in the open-source ecosystem including Ray PyTorch Core PyTorch Symmetric Memory PyTorch and" [X Link](https://x.com/vllm_project/status/2000108248339239161) 2025-12-14T07:39Z 27K followers, 18K engagements "Multimodal serving pain: vision encoder work can stall text prefill/decode and make tail latency jittery. We built Encoder Disaggregation (EPD) in vLLM: run the encoder as a separate scalable service pipeline it with prefill/decode and reuse image embeddings via caching. This provides an efficient and flexible pattern for multimodal serving. Results: consistently higher throughput (520% across stable regions) and significant reductions in P99 TTFT and P99 TPOT. Read more: #vLLM #LLMInference #Multimodal https://blog.vllm.ai/2025/12/15/vllm-epd.html https://blog.vllm.ai/2025/12/15/vllm-epd.html" [X Link](https://x.com/vllm_project/status/2000535421642502335) 2025-12-15T11:56Z 27.4K followers, 10.9K engagements "๐Amazing work from @Winterice10 and the team on fast video generationโก We're excited about the upcoming collaboration to integrate TurboDiffusion into vLLM-Omni. Check it out TurboDiffusion: [------] faster video generation on a single RTX [----] ๐ Only takes 1.8s to generate a high-quality 5-second video. The key to both high speed and high quality ๐SageAttention + Sparse-Linear Attention (SLA) + rCM Github: https://t.co/ybbNBjgHFP Technical https://t.co/6d6foxEQ9Z TurboDiffusion: [------] faster video generation on a single RTX [----] ๐ Only takes 1.8s to generate a high-quality 5-second video." [X Link](https://x.com/vllm_project/status/2000720345872130413) 2025-12-16T00:11Z 27.5K followers, 10.1K engagements "vLLM delivers even more inference performance with the same GPU platform. In just [--] month we've worked with NVIDIA to increase @nvidia Blackwell maximum throughput per GPU by up to 33% -- significantly reducing cost per token -- while also enabling even higher peak speed for the most latency-sensitive use cases powered by deep PyTorch integration and collaboration" [X Link](https://x.com/vllm_project/status/2001449658984632699) 2025-12-18T00:29Z 28.4K followers, 99.2K engagements "Scaling MoE inference is often communication + KV-cache bound: once you push expert parallelism decode can become dominated by collectives and imbalance and prefill stragglers can stall an entire EP group. New community benchmark results for vLLM wide-EP on multi-node H200 (Coreweave Infiniband + ConnectX-7): - Sustained 2.2k tokens/s per H200 GPU (up from earlier 1.5k tokens/s per GPU) In the post we share the key pieces that enable this: - Wide-EP (--enable-expert-parallel) for DeepSeek-style MoE + MLA KV efficiency - DeepEP all-to-all Dual-batch Overlap (DBO) and Expert Parallel Load" [X Link](https://x.com/vllm_project/status/2001695354983723361) 2025-12-18T16:45Z 28.4K followers, 30.1K engagements "Diffusion serving is expensive: dozens of timesteps per image and a lot of redundant compute between adjacent steps. โกvLLM-Omni now supports diffusion cache acceleration backends (TeaCache + Cache-DiT) to reuse intermediate Transformer computations no retraining minimal quality impact ๐Benchmarks (NVIDIA H200 Qwen-Image 1024x1024): TeaCache 1.91x Cache-DiT 1.85x. For Qwen-Image-Edit Cache-DiT hits 2.38x Blog: Docs: #vLLM #vLLMOmni #DiffusionModels #AIInference https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion_acceleration" [X Link](https://x.com/vllm_project/status/2002233999947890810) 2025-12-20T04:25Z 27.8K followers, 20.2K engagements "๐Thanks to the communitys contributions and our collaboration with Qwen-Image team Qwen-Image-Layered is now supported in vLLM-Omni Check it out https://github.com/vllm-project/vllm-omni/pull/381 ๐จ Qwen-Image-Layered is LIVE native image decomposition fully open-sourced โจ Why it stands out โ Photoshop-grade layering Physically isolated RGBA layers with true native editability โ Prompt-controlled structure Explicitly specify [---] layers from coarse layouts to https://t.co/g5mvTt0KTT https://github.com/vllm-project/vllm-omni/pull/381 ๐จ Qwen-Image-Layered is LIVE native image decomposition" [X Link](https://x.com/vllm_project/status/2002444173153349957) 2025-12-20T18:21Z 27.9K followers, 10.6K engagements "๐ขvLLM v0.13.0 is now available. Huge thanks to the community. Highlights: Engine core: compile_ranges for selective kernel compilation PrefixLM support for FlexAttention + TritonAttention and CUDA graphs for 3D Triton attention. Plus: xxHash option for prefix caching chunked prefill for ALL pooling tasks and Model Runner V2 updates (min-p sampling logits NaN detection)" [X Link](https://x.com/vllm_project/status/2002984872814457060) 2025-12-22T06:09Z 27.8K followers, 24.7K engagements "Congrats to the GLM team on GLM-4.7 a step up in the GLM-4.x series with day-0 serving support in vLLM๐ โก Support MTP decode (faster throughput). ๐งฐ Tool/function calling. ๐ง Thinking controls: interleaved/preserved/per-turn. Command in screenshot below๐ Read more: https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html GLM-4.7 is here GLM-4.7 surpasses GLM-4.6 with substantial improvements in coding complex reasoning and tool usage setting new open-source SOTA standards. It also boosts performance in chat creative writing and role-play scenarios. Default Model for Coding Plan:" [X Link](https://x.com/vllm_project/status/2003269455942651925) 2025-12-23T01:00Z 27.6K followers, 17.8K engagements "๐ The vLLM community has added support for LongCat-Image-Edit (from @Meituan_LongCat team) in vLLM-Omni. - Simpler path to serve instruction-following image edits - Supports common operations like object add/replace background changes and style adjustments - Useful for retouching tools and creative editing pipelines Recipe: https://docs.vllm.ai/projects/recipes/en/latest/Meituan/Longcat.html https://twitter.com/i/web/status/2003313952516714683 https://docs.vllm.ai/projects/recipes/en/latest/Meituan/Longcat.html https://twitter.com/i/web/status/2003313952516714683" [X Link](https://x.com/vllm_project/status/2003313952516714683) 2025-12-23T03:57Z 29.8K followers, [----] engagements "Huge congrats to @MiniMaxAI_ on shipping M2.1 ๐ฅณ A massive step forward for open-source agents. vLLM provides Day-0 support for this release. ๐ We are excited to empower the community to run this full-stack development powerhouse with maximum efficiency. Model Repo: Deploy Recipe: Commands below ๐ #vLLM #MiniMax #OpenSource #AI #LLM https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html#launching-minimax-m21m2-with-vllm https://huggingface.co/MiniMaxAI/MiniMax-M2.1 https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html#launching-minimax-m21m2-with-vllm" [X Link](https://x.com/vllm_project/status/2004480564020253074) 2025-12-26T09:13Z 28.2K followers, 22.2K engagements "๐ Big news: The official vLLM website is LIVE Weve built a dedicated hub for our growing community separating logistics from code so the GitHub repo can focus purely on development. Highlights: โจ Interactive vLLM Install Selector (GPU CPU etc) ๐ Community Events Calendar (Office hours Meetups etc) ๐ Centralized Resources (Doc Recipe etc) Check it out ๐ https://vllm.ai https://vllm.ai https://vllm.ai https://vllm.ai" [X Link](https://x.com/vllm_project/status/2005461211656155153) 2025-12-29T02:09Z 28.4K followers, 54.6K engagements "Thanks for diving deep into vLLM and sharing your findings. ๐ซถ Were working to make more beginner-friendly documentation available. In the meantime we recommend using the Search (AI) button on our website to find what you need and also following up vLLM office hours https://www.youtube.com/playlistlist=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 http://vllm.ai http://vllm.ai we dive deep into vllm-project/vllm https://t.co/dOjzontlCj https://www.youtube.com/playlistlist=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 http://vllm.ai http://vllm.ai we dive deep into vllm-project/vllm https://t.co/dOjzontlCj" [X Link](https://x.com/vllm_project/status/2005640089133830371) 2025-12-29T14:00Z 28.4K followers, 36.8K engagements "๐ What a way to end [----] ๐vllm-project/vllm just hit [----] contributors on GitHub. This milestone belongs to all of you. From the first-time contributor fixing a doc typo to the systems engineer rewriting kernelsyou are the reason vLLM evolves so fast. Thank you for every PR every issue and every debate. We built this engine together. ๐๐ With this speed comes complexity. To help you track your code we added a little utility. ๐" [X Link](https://x.com/vllm_project/status/2006208185837724070) 2025-12-31T03:37Z 28.8K followers, [----] engagements "Congrats to @Alibaba_Qwen on the release of Qwen-Image-2512 ๐ We are thrilled to announce Day-0 support in vLLM-Omni. You can now serve this SOTA open-source image model with our optimized pipelined architecture immediately. Read more: ๐ See it running below: https://github.com/vllm-project/vllm-omni/pull/547 https://github.com/vllm-project/vllm-omni/pull/547 ๐ANewYeargiftfromQwenQwen-Image-2512ishere. ๐OurDecemberupgradetoQwen-ImagejustintimefortheNewYear. โจWhatsnew: MorerealistichumansdramaticallyreducedAIlookricherfacialdetails Finernaturaltexturessharperlandscapeswater" [X Link](https://x.com/vllm_project/status/2006333700988936360) 2025-12-31T11:56Z 29.9K followers, 34K engagements "Spot on @OracleDevs combining vLLM with NVIDIA MIG unlocks peak GPU performance and turns hardware into real AI value With NVIDIA MIG for fine-grained GPU partitioning OKE for scheduling and autoscaling vLLM for inference-optimized serving and DGCM observability for monitoring MIG-level metrics customers can unlock every ounce of performance from their A100 H100 etc. investments and turn https://t.co/kshxcuceNt With NVIDIA MIG for fine-grained GPU partitioning OKE for scheduling and autoscaling vLLM for inference-optimized serving and DGCM observability for monitoring MIG-level metrics" [X Link](https://x.com/vllm_project/status/2007654309044072843) 2026-01-04T03:24Z 28.3K followers, [----] engagements "๐ Celebrating vLLM Semantic Router v0.1 Iris ๐ A huge milestone with 600+ PRs from 50+ contributors This release brings system-level intelligence to Mixture-of-Models (MoM). Highlights: โจ SignalDecision Plugin Chain ๐ก HaluGate (Hallucination detection) โก Modular LoRA & Helm charts Read the release: #vLLM #SemanticRouter #OpenSource #AI https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html" [X Link](https://x.com/vllm_project/status/2008369377956295060) 2026-01-06T02:45Z 28.7K followers, 13.7K engagements "Huge congratulations to the community on shipping vLLM-Omni v0.12.0rc1 ๐ This release shifts the focus from enabling multimodal to making it production-gradefaster stable and standard-compliant. Highlights from the [---] commits: ๐ Diffusion Overhaul: Integrated TeaCache Cache-DiT Sage Attention Ulysses Sequence Parallelism & Ring Attention for faster generation. ๐ Standard Serving: Native OpenAI-compatible endpoints for Image & Speech. ๐น New Models: Support for Wan2.2 (Video) Qwen-Image-2512 & SD3. ๐ค AMD Ready: Official ROCm Docker & CI support. A massive thank you to the [--] contributors" [X Link](https://x.com/vllm_project/status/2008482657991368738) 2026-01-06T10:15Z 28.7K followers, [----] engagements "๐16k TPS with vLLM on B200 Thanks for sharing this success; it's inspiring our community to push boundaries. 16k tokens per second ๐คฏ i have NEVER seen this many tokens in my life nvidia B200 from prime trinity mini from arcee (26b moe) served by vllm (0.13) with [--] tensors parallelism medical SYNTH dataset generation pipeline [---] req/s 16k tps DAMN https://t.co/Ov8TWhmvOZ 16k tokens per second ๐คฏ i have NEVER seen this many tokens in my life nvidia B200 from prime trinity mini from arcee (26b moe) served by vllm (0.13) with [--] tensors parallelism medical SYNTH dataset generation pipeline 350" [X Link](https://x.com/vllm_project/status/2009196819331600648) 2026-01-08T09:33Z 28.4K followers, 12.6K engagements "Max out your inference throughput with vLLM's new KV Offloading Connector ๐ This feature from IBM Research allows asynchronous offloading of KV cache to CPU RAM effectively handling request preemptions and boosting concurrency. โก Up to 9x increase in throughput on H100 โก 2x-22x reduction in TTFT for cache hits https://twitter.com/i/web/status/2009217642507477222 https://twitter.com/i/web/status/2009217642507477222" [X Link](https://x.com/vllm_project/status/2009217642507477222) 2026-01-08T10:56Z 28.7K followers, 13.5K engagements "To make this efficient the team optimized the Host-Device transfer pipeline. By restructuring GPU memory to use contiguous physical blocks (KB MB) this design unlocks high-speed DMA transfers that run asynchronously without stalling GPU computation" [X Link](https://x.com/vllm_project/status/2009217645946773534) 2026-01-08T10:56Z 28.4K followers, [----] engagements "Want to try it --kv_offloading_backend native --kv_offloading_size GB Read the full deep dive on these optimizations for memory variance and DMA: #vLLM #AI #Inference https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html" [X Link](https://x.com/vllm_project/status/2009217648224247840) 2026-01-08T10:56Z 28.4K followers, [----] engagements "๐ vLLM-Omni now Day-0 supports GLM-Image by @Zai_org ๐จโจ GLM-Image combines Autoregressive & Diffusion capability for realistic text-rich image generation. ๐ฅ Highlights: โ Hybrid Architecture โ Superior Text Rendering โ Complex Logic Handling โ Unified Generation/Editing Try it now via PR #763 ๐ Performance optimizations for the AR component are coming soon Stay tuned. โก https://github.com/vllm-project/vllm-omni/pull/763 Introducing GLM-Image: A new milestone in open-source image generation. GLM-Image uses a hybrid auto-regressive plus diffusion architecture combining strong global" [X Link](https://x.com/vllm_project/status/2011417983793717688) 2026-01-14T12:39Z 29.9K followers, [----] engagements "Batch inference unlocks vLLM's full throughput potential. ๐ @charles_irl at @modal demonstrates how to fully saturate an H100 (100% GPU util) with Qwen [--] 8B: ๐ FlashInfer backend ๐ Async scheduling ๐ Optimized batch sizes Result: Maximum throughput at minimal cost. Read the guide: https://modal.com/docs/examples/vllm_throughput We also show how to hit 30k input tok/s & 2k output tok/s per H100 with the same model in @vllm_project for a latency-insensitive "batch" workload (summarizing SEC filings for insider trades). At current rates that's 5x cheaper than APIs. https://t.co/EF7DEvLPIM" [X Link](https://x.com/vllm_project/status/2011585247297880501) 2026-01-14T23:44Z 30.3K followers, 30.6K engagements "7x Longer Context RL with @UnslothAI and @vllm_project You can now do reinforcement learning training with [--] longer context and no accuracy loss via our new batching algorithms. Long reasoning chains in RL are costly but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU. https://t.co/FFwxxq2v2N https://t.co/1h5O8rHZTL You can now do reinforcement learning training with [--] longer context and no accuracy loss via our new batching algorithms. Long reasoning chains in RL are costly but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU." [X Link](https://x.com/vllm_project/status/2011857612103630924) 2026-01-15T17:46Z 29.3K followers, 13.6K engagements "The first @vllm_project meetup of [----] in Munich is here Talks & demos from @RedHat @AMD @MistralAI and CROZ + hands-on GPU workshop. See you there ๐ฅจ๐ป Munich AI builders ๐ Join the @vllm_project meetup on [--] Feb for real world GenAI inference and optimization. Talks and demos from @RedHat @AMD @MistralAI and CROZ plus hands on GPU inference and time to connect with engineers building open AI. ๐ https://t.co/hZOi0Ei0bw https://t.co/EtvEzd1UYm Munich AI builders ๐ Join the @vllm_project meetup on [--] Feb for real world GenAI inference and optimization. Talks and demos from @RedHat @AMD" [X Link](https://x.com/vllm_project/status/2012103510008201358) 2026-01-16T10:03Z 28.9K followers, [----] engagements "๐ Day-0 support for FLUX.2 klein is now available in vLLM-Omni FLUX.2 klein from @bfl_ml brings high-performance image generation balancing speed with top-tier aesthetics: โก Sub-second Inference: 0.5s per image for real-time apps. ๐จ All-in-One: Integrated Text-to-Image & Inpainting. ๐ Consumer Friendly: Runs on consumer GPUs (13GB VRAM). ๐ Open Source: Apache [---] licensed 4B model. Try it out with the example script in PR #809: Model: #vLLM #vLLMOmni #FLUX2 #OpenSource https://huggingface.co/black-forest-labs/FLUX.2-klein-4B https://github.com/vllm-project/vllm-omni/pull/809 Introducing" [X Link](https://x.com/vllm_project/status/2012110024294965406) 2026-01-16T10:29Z 30.2K followers, [----] engagements "๐ Congrats @StepFun_ai vLLM now has Day-0 support for Step3-VL-10B a 10B multimodal model that punches way above its weight ๐ฅ 10B params SOTA performance ๐ง [--] reasoning modes: SeRe (Sequential) & PaCoRe (Parallel โก 10x-20x more efficient than larger models Download & Details: PR: https://github.com/vllm-project/vllm/pull/32329 https://huggingface.co/stepfun-ai/Step3-VL-10B ๐10B parameters 200B+ performance Introducing STEP3-VL-10B : our open-source SOTA vision language model. ๐ฅ At just 10B it redefines efficiency by matching or exceeding the capabilities of 100B/200B-scale models. SOTA" [X Link](https://x.com/vllm_project/status/2013764751348613235) 2026-01-21T00:05Z 29.9K followers, [----] engagements "๐ gRPC server entrypoint (#30190) Binary protocol + HTTP/2 multiplexing for high-throughput serving. ๐ง --max-model-len auto (#29431) Automatically fits context length to available GPU memory - no more OOM at startup ๐ Model inspection view (#29450) See modules attention backends and quantization by setting VLLM_LOG_MODEL_INSPECTION=1 or printing the LLM object" [X Link](https://x.com/vllm_project/status/2013812835420086562) 2026-01-21T03:16Z 29.7K followers, [----] engagements "New Model Support: ๐ฆ Grok-2 with tiktoken tokenizer ๐ LFM2-VL vision-language model โก MiMo-V2-Flash ๐ GLM-ASR audio ๐งฉ K-EXAONE-236B-A23B MoE LoRA now supports multimodal tower/connector for LLaVA BLIP2 PaliGemma Pixtral and more ๐ฅ" [X Link](https://x.com/vllm_project/status/2013812837987111236) 2026-01-21T03:16Z 30K followers, [----] engagements "CUTLASS MoE optimizations: 2.9% throughput + 10.8% TTFT improvement via fill(0) optimization ๐ Hardware updates: ๐ป SM103 support โซ B300 Blackwell MoE configs ๐ข Marlin for Turing (sm75) Large-scale serving: XBO (Extended Dual-Batch Overlap) NIXL asymmetric TP ๐ ๐ https://github.com/vllm-project/vllm/releases/tag/v0.14.0 https://github.com/vllm-project/vllm/releases/tag/v0.14.0" [X Link](https://x.com/vllm_project/status/2013812840688189586) 2026-01-21T03:16Z 29.9K followers, [----] engagements "Love seeing this. ๐ฅ This is vLLM doing what it does best: serving as the bridge between powerful open source models like GLM-4.7-Flash and real-world inference. The workflow shown in vLLM Studio by @0xSero as a playground makes that connection tangible. Onward to easy fast and cheap inference for everyone ๐ Man what a model. I have not seen any mode below 200B act like this. It's really doing a good job in a pretty novel environment. The ZAI team is doing a great job I'm very happy they still open source all this. It's very impressive. https://t.co/BB6A0ezLGY https://t.co/kDj5KzWuFY Man" [X Link](https://x.com/vllm_project/status/2014536660361584833) 2026-01-23T03:12Z 30.3K followers, 14.9K engagements "vLLM 0.4.0 release has a suite of new additions To start we added support for @cohere's Command+R @DbrxMosaicAI's DBRX @Alibaba_Qwen's latest MoE. Notably all of these models are supported on DAY ONE directly from the model builders" [X Link](https://x.com/vllm_project/status/1776283242280390788) 2024-04-05T16:18Z 31.2K followers, [----] engagements "We just released v0.5.0 of vLLM one week away from our one year anniversary The new release marks many features entering beta phase ready for testing and usage. These features boost your LLM serving system for better performance and utilization. https://github.com/vllm-project/vllm/releases/tag/v0.5.0 https://github.com/vllm-project/vllm/releases/tag/v0.5.0" [X Link](https://x.com/vllm_project/status/1801417084058153454) 2024-06-14T00:51Z 31.2K followers, 11.7K engagements "vLLM running on the latest @nvidia H200 GPUs is delivering high token throughput per user on Llama [---] 405B out of the gate Check it out ๐" [X Link](https://x.com/vllm_project/status/1815805445770289336) 2024-07-23T17:45Z 31.2K followers, 12.8K engagements "๐ Thank you @nvidia for sponsoring vLLM development. The DGX H200 machine is marvelous We plan to use the machine for benchmarking and performance enhancement ๐" [X Link](https://x.com/vllm_project/status/1821644779727548808) 2024-08-08T20:28Z 31.2K followers, 42.1K engagements "A month ago we announced our performance roadmap. Today we are happy to share that the latest release achieves ๐2.7x higher throughput and is 5x faster for output latency on Llama 8B and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://blog.vllm.ai/2024/09/05/perf-update.html https://blog.vllm.ai/2024/09/05/perf-update.html" [X Link](https://x.com/vllm_project/status/1831742284804866237) 2024-09-05T17:12Z 31.2K followers, 95.9K engagements "We are excited to see @vllm_project as an option for local apps in the @huggingface hub It comes with easy snippets to quickly test out the model" [X Link](https://x.com/vllm_project/status/1833257997814096245) 2024-09-09T21:35Z 31.3K followers, 25.5K engagements "๐ผ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens [-----] https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py magnet:xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%https://t.co/OdtBUsbMKD%3A1337%2Fannounce&tr=udp%3A%2F%https://t.co/2UepcMHjvL%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/NsTRgy7h8S%3A80%2Fannounce https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py" [X Link](https://x.com/vllm_project/status/1834049471451476426) 2024-09-12T02:00Z 31.2K followers, 36K engagements "pip install -U vLLM ๐ฆ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs [--] We are excited to announce day [--] support for @AIatMeta's Llama [---] vision language models. Try it out using our latest released. Please see for more commands and known issues. https://github.com/vllm-project/vllm/issues/8826 ๐ฃ Introducing Llama 3.2: Lightweight models for edge devices vision models and more Whats new Llama [---] 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases with support for @Arm @MediaTek & @Qualcomm on day one." [X Link](https://x.com/vllm_project/status/1839075248823742714) 2024-09-25T22:51Z 31.2K followers, 20.4K engagements "Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today ๐งต https://blog.vllm.ai/2024/10/17/spec-decode.html https://blog.vllm.ai/2024/10/17/spec-decode.html" [X Link](https://x.com/vllm_project/status/1848854044984725922) 2024-10-22T22:28Z 31.2K followers, 31K engagements "Open-source innovation is part of the vLLMs DNA and we love the PyTorch ecosystem Together let's push the boundaries of AI innovation and make it accessible to all๐ช vLLM Joins PyTorch Ecosystem ๐ @vllm_project has always had a strong connection with the PyTorch project. Tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms. Read more: https://t.co/GgTrTt2srd https://t.co/YYEQFVLlD6 vLLM Joins PyTorch Ecosystem ๐ @vllm_project has always had a strong connection with the PyTorch project. Tight coupling with PyTorch ensures" [X Link](https://x.com/vllm_project/status/1866228071818473512) 2024-12-09T21:06Z 31.2K followers, 24.1K engagements "pip install -U vLLM You can now run DeepSeek-V3 on latest vLLM many different ways: ๐ฐ Tensor parallelism on 8xH200 or MI300x or TP16 on IB connected nodes: --tensor-parallel-size ๐ Pipeline parallelism () across two 8xH100 or any collection of machines without high speed interconnect: --pipeline-parallel-size ๐ CPU offloading the model layers via --cpu-offload-gb See our distributed guide and enhancement plan We are at the start of optimizing this amazing model https://github.com/vllm-project/vllm/issues/11539 https://docs.vllm.ai/en/latest/serving/distributed_serving.html ๐ Introducing" [X Link](https://x.com/vllm_project/status/1872453508127130017) 2024-12-27T01:24Z 31.2K followers, 35.6K engagements "Thanks for the community support vLLM now supports MacOS natively๐คฉ Experimental support for Apple SoC (CPU only for now) has landed in @vllm_project. You'll need to compile from git HEAD. Link in ๐งต Experimental support for Apple SoC (CPU only for now) has landed in @vllm_project. You'll need to compile from git HEAD. Link in ๐งต" [X Link](https://x.com/vllm_project/status/1877542671113400811) 2025-01-10T02:27Z 31.3K followers, 17.5K engagements "๐ With the v0.7.0 release today we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup Clean code optimized execution loop zero-overhead prefix caching enhanced multimodal support and more" [X Link](https://x.com/vllm_project/status/1883966341557936514) 2025-01-27T19:52Z 31.2K followers, 96.1K engagements "We landed the 1st batch of enhancements to the @deepseek_ai models starting MLA and cutlass fp8 kernels. Compared to v0.7.0 we offer 3x the generation throughput 10x the memory capacity for tokens and horizontal context scalability with pipeline parallelism" [X Link](https://x.com/vllm_project/status/1885837174588989695) 2025-02-01T23:46Z 31.2K followers, 91.5K engagements "v0.7.2 is released Featuring ๐ผ @Alibaba_Qwen Qwen2.5-VL ๐ค @huggingface Transformers backend and several @deepseek_ai performance enhancements" [X Link](https://x.com/vllm_project/status/1887579288167456943) 2025-02-06T19:09Z 31.2K followers, 102.9K engagements "vLLM v0.7.3 now supports @deepseek_ai's Multi-Token Prediction module It delivers up to 69% speedup boost. You can turn it on with --num-speculative-tokens=1 and an optional --draft-tensor-parallel-size=1. We saw 81-82.3% acceptance rate on the ShareGPT" [X Link](https://x.com/vllm_project/status/1892646680719216960) 2025-02-20T18:45Z 31.3K followers, 40K engagements "We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development Thank you @nvidia" [X Link](https://x.com/vllm_project/status/1893001644037566610) 2025-02-21T18:15Z 31.2K followers, 119.5K engagements "๐ We just merged initial EP support in vLLM and will be integrating these collectives ASAP ๐ https://github.com/vllm-project/vllm/pull/12583 ๐ Day [--] of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference. โ Efficient and optimized all-to-all communication โ Both intranode and internode support with NVLink and RDMA โ https://github.com/vllm-project/vllm/pull/12583 ๐ Day [--] of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and" [X Link](https://x.com/vllm_project/status/1894215122966507801) 2025-02-25T02:37Z 31.2K followers, 40.2K engagements "๐ @vllm_project will be testing and integrating these GEMM kernels ASAP as well. ๐ Day [--] of #OpenSourceWeek: DeepGEMM Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs powering V3/R1 training and inference. โก Up to 1350+ FP8 TFLOPS on Hopper GPUs โ No heavy dependency as clean as a tutorial โ Fully Just-In-Time compiled ๐ Day [--] of #OpenSourceWeek: DeepGEMM Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs powering V3/R1 training and inference. โก Up to 1350+ FP8 TFLOPS on Hopper GPUs โ No heavy dependency as clean as a tutorial" [X Link](https://x.com/vllm_project/status/1894566889529245981) 2025-02-26T01:55Z 31.3K followers, 23K engagements "Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16% We expect more improvements in the coming days as we continue to optimize the code path. https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs optimized for variable-length sequences and now in production. โ BF16 support โ Paged KV cache (block size 64) โก [----] GB/s memory-bound & [---] TFLOPS https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our" [X Link](https://x.com/vllm_project/status/1894994674630435123) 2025-02-27T06:15Z 31.4K followers, 49.5K engagements "Amazing system It is now the north star for LLM inference ๐. We will get there quickly. ๐ Day [--] of #OpenSourceWeek: One More Thing DeepSeek-V3/R1 Inference System Overview Optimized throughput and latency via: ๐ง Cross-node EP-powered batch scaling ๐ Computation-communication overlap โ Load balancing Statistics of DeepSeek's Online Service: โก 73.7k/14.8k ๐ Day [--] of #OpenSourceWeek: One More Thing DeepSeek-V3/R1 Inference System Overview Optimized throughput and latency via: ๐ง Cross-node EP-powered batch scaling ๐ Computation-communication overlap โ Load balancing Statistics of" [X Link](https://x.com/vllm_project/status/1895698627714257355) 2025-03-01T04:52Z 31.3K followers, 46.8K engagements "Spotted @vllm_project during Jensen's Keynote @nvidia #GTC" [X Link](https://x.com/vllm_project/status/1902065243343425949) 2025-03-18T18:31Z 31.3K followers, 38.6K engagements "vLLM v0.8.3 now supports @AIatMeta's latest Llama [--] Scout and Maverick. We see these open source models as a major step forward in efficiency with long context feature native multi-modality and MoE architecture. Best tips of running it ๐งต https://blog.vllm.ai/2025/04/05/llama4.html https://blog.vllm.ai/2025/04/05/llama4.html" [X Link](https://x.com/vllm_project/status/1908777599826092443) 2025-04-06T07:03Z 31.3K followers, 24.9K engagements "spotted @vllm_project at @googlecloud next keynote today" [X Link](https://x.com/vllm_project/status/1910191668437156154) 2025-04-10T04:42Z 31.2K followers, 51.6K engagements "๐ @deepseek_ai's highly performant inference engine is built on top of vLLM. Now they are open-sourcing the engine the right way: instead of a separate repo they are bringing changes to the open source community so everyone can immediately benefit https://github.com/deepseek-ai/open-infra-index/blob/main/OpenSourcing_DeepSeek_Inference_Engine/README.md https://github.com/deepseek-ai/open-infra-index/blob/main/OpenSourcing_DeepSeek_Inference_Engine/README.md" [X Link](https://x.com/vllm_project/status/1911669255428542913) 2025-04-14T06:34Z 31.2K followers, 206.6K engagements "vLLM๐ค๐ค You can now deploy any @huggingface language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. ๐งต https://blog.vllm.ai/2025/04/11/transformers-backend.html https://blog.vllm.ai/2025/04/11/transformers-backend.html" [X Link](https://x.com/vllm_project/status/1912958639633277218) 2025-04-17T19:57Z 31.3K followers, 73.8K engagements "perf update: we are continuing to see benefits with vLLM V1 engines highly performant design. on 8xH200 vLLM leads in throughput for @deepseek_ai V3/R1 models. we expect further enhancements in collaboration with DeepSeeks inference engine open source plan" [X Link](https://x.com/vllm_project/status/1913004324613243166) 2025-04-17T22:59Z 31.2K followers, 42.1K engagements "After feedback about our v0.8.4 benchmark for @deepseek_ai R1 we rerun it with suggested changes: vLLM no EP SGLang updated v0.4.5 - post1 and EP - DP TensorRT-LLM uses overlap scheduler and tuned parameters. We are seeing good results So why was there a difference ๐งต" [X Link](https://x.com/vllm_project/status/1913513173342392596) 2025-04-19T08:41Z 31.4K followers, 80.2K engagements "OpenRLHF is a pioneering framework to use vLLM for RLHF driving many design and implementation of vLLM's features for RLHF making vLLM a popular choice for many RLHF frameworks. Learn more about the story at https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html" [X Link](https://x.com/anyuser/status/1915307134256091570) 2025-04-24T07:29Z 31.2K followers, 48.4K engagements "looking amazing We found the quality of the documentation and diagram to be great and accurate. Will recommend :D https://deepwiki.com/vllm-project/vllm Project DeepWiki Up-to-date documentation you can talk to for every repo in the world. Think Deep Research for GitHub powered by Devin. Its free for open-source no sign-up Visit deepwiki com or just swap github deepwiki on any repo URL: https://t.co/5bHbvq98Ud https://deepwiki.com/vllm-project/vllm Project DeepWiki Up-to-date documentation you can talk to for every repo in the world. Think Deep Research for GitHub powered by Devin. Its free" [X Link](https://x.com/vllm_project/status/1915900439939453200) 2025-04-25T22:47Z 31.3K followers, 14K engagements "pip install -U vLLM vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size [--] vLLM introduce Day [--] support for @Alibaba_Qwen Qwen3 and Qwen3 MoE model architecture. Try it out: https://github.com/vllm-project/vllm/issues/17327 Introducing Qwen3 We release and open-weight Qwen3 our latest large language models including [--] MoE models and [--] dense models ranging from 0.6B to 235B. Our flagship model Qwen3-235B-A22B achieves competitive results in benchmark evaluations of coding math general https://t.co/JWZkJeHWhC" [X Link](https://x.com/vllm_project/status/1917008899410215275) 2025-04-29T00:11Z 31.3K followers, 60.3K engagements "Very cool spec decode technique built on top of @vllm_project We love the idea of suffix decoding ๐ Excited to share our work on Speculative Decoding @Snowflake AI Research ๐ 4x faster LLM inference for coding agents like OpenHands @allhands_ai ๐ฌ 2.4x faster LLM inference for interactive chat ๐ป Open-source via Arctic Inference as a plugin for @vllm_project ๐งต https://t.co/cdTuI7b9Yr Excited to share our work on Speculative Decoding @Snowflake AI Research ๐ 4x faster LLM inference for coding agents like OpenHands @allhands_ai ๐ฌ 2.4x faster LLM inference for interactive chat ๐ป" [X Link](https://x.com/vllm_project/status/1918367999825985651) 2025-05-02T18:12Z 31.3K followers, [----] engagements "Great work We love how @vllm_project is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked This is a small feature we implemented to make RLHF smoother with vLLM. Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini) ๐งต https://t.co/I3UXLGKFNb Our research interns present: Kevin-32B = K(ernel D)evin It's the first" [X Link](https://x.com/vllm_project/status/1919931593252053462) 2025-05-07T01:45Z 31.3K followers, 12.9K engagements ""inference uses @vllm_project" Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2 Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2" [X Link](https://x.com/vllm_project/status/1921802308129542551) 2025-05-12T05:39Z 31.2K followers, 10.7K engagements "uv pip install -U vLLM The latest release features [---] commits from [---] contributors. vLLM is now ready for @nvidia Blackwell with the latest @PyTorch [---] upgrade. Huge thanks to @NVIDIAAIDev and @ye_combinator for the CUTLASS and FlashInfer kernels" [X Link](https://x.com/vllm_project/status/1927543954167218317) 2025-05-28T01:54Z 31.2K followers, 17.7K engagements "Congrats on the launch vLLM is proud to support the new Qwen3 embedding models check it out ๐๐ป https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series setting new standards in multilingual text embedding and relevance ranking โจ Highlights: โ Available in 0.6B / 4B / 8B versions โ Supports [---] languages โ State-of-the-Art performance on MMTEB MTEB https://t.co/qNu0rswSol https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series" [X Link](https://x.com/vllm_project/status/1930652197882351702) 2025-06-05T15:45Z 31.3K followers, 12.2K engagements "uv pip install -U vllm --extra-index-url --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today ๐ฎ https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh" [X Link](https://x.com/vllm_project/status/1932447031542608118) 2025-06-10T14:37Z 31.2K followers, 11.7K engagements "๐ Look what just arrived at @BerkeleySky ๐ A shiny MI355X system. Huge thanks to @AMD for supporting open source and we are looking forward to getting it set up in the next few days" [X Link](https://x.com/anyuser/status/1932487956159475896) 2025-06-10T17:20Z 31.3K followers, 20.7K engagements "Thank you @AMD @LisaSu @AnushElangovan for Advancing AI together with @vllm_project We look forward to the continued partnership and pushing the boundary of inference" [X Link](https://x.com/anyuser/status/1933674276659576921) 2025-06-13T23:54Z 31.2K followers, 17.3K engagements "Congrats on the launch vLLM is proud to support this great model on day [--] looking forward to the following releases Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency: https://t.co/bGfDlZA54n Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output -" [X Link](https://x.com/vllm_project/status/1934645728103813433) 2025-06-16T16:14Z 31.3K followers, 19K engagements "vLLM has just reached 50K github stars Huge thanks to the community๐ Together let's bring easy fast and cheap LLM serving for everyoneโ๐ป" [X Link](https://x.com/vllm_project/status/1935569537858183321) 2025-06-19T05:25Z 31.2K followers, 18.3K engagements ""The second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window" Checkout the blog post from @minimax_ai on how the model is implemented on vLLM and how you can run this model efficiently https://blog.vllm.ai/2025/06/30/minimax-m1.html MiniMax launches their first reasoning model: MiniMax M1 the second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window @minimax_ai M1 is based on their Text-01 model (released [--] Jan 2025) - an MoE with 456B total and 45.9B active https://t.co/JltMYrm0te" [X Link](https://x.com/vllm_project/status/1939900796310888587) 2025-07-01T04:16Z 31.2K followers, 13.9K engagements "We genuinely want to solve this problem. As many (@Rxday000 @samsja19 @danielhanchen @_EldarKurtic and more) chimed in the reason includes attention kernels matmul reduction order precisions in various operators and more horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh" [X Link](https://x.com/vllm_project/status/1941973029623632290) 2025-07-06T21:30Z 31.2K followers, 37.3K engagements "vLLM runs on free-threaded Python A group of engineers from @Metas Python runtime language team has shown that its possible to run vLLM on the nogil distribution of Python. Were incredibly excited to embrace this future technique and be early adopters ๐" [X Link](https://x.com/vllm_project/status/1942450223881605593) 2025-07-08T05:06Z 31.2K followers, 50.6K engagements "@Kimi_Moonshot just released a trillion-parameter model with great agentic capability and it is already supported in vLLM Have a try with a simple command and check the doc for more advanced deployment๐ ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench Verified Tau2 & AceBench among open models ๐นStrong in coding and agentic tasks ๐ค Multimodal & thought-mode not supported for now With Kimi K2 advanced agentic intelligence https://t.co/PlRQNrg9JL ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench" [X Link](https://x.com/vllm_project/status/1943688429205860584) 2025-07-11T15:06Z 31.4K followers, 34.4K engagements "Thanks for the great write-up ๐ Prefix caching is critical for agentic workflows like @manusai and vLLM makes it seamless. โ prefix caching is enabled by default with an efficient implementation โ Append-only context Cache hit heaven Context engineering FTW ๐ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ" [X Link](https://x.com/vllm_project/status/1946575947295322171) 2025-07-19T14:20Z 31.2K followers, 15K engagements "The @huggingface Transformers @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers you can now run it directly with vLLM. Great work @RTurganbay ๐ https://github.com/vllm-project/vllm/pull/20543 https://github.com/vllm-project/vllm/pull/20543" [X Link](https://x.com/anyuser/status/1947756551663718754) 2025-07-22T20:32Z 31.2K followers, 22.3K engagements "โ Try out @Alibaba_Qwen [--] Coder on vLLM nightly with "qwen3_coder" tool call parser Additionally vLLM offers expert parallelism so you can run this model in flexible configurations where it fits. Qwen3-Coder is here โ Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves https://t.co/Z8HfyrVScE Qwen3-Coder is here โ Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to" [X Link](https://x.com/vllm_project/status/1947780382847603053) 2025-07-22T22:06Z 31.2K followers, 34.2K engagements "This amazing Attention-FFN disaggregation implementation from @StepFun_ai achieves decoding throughput of up to [----] tokens per second per GPU under 50ms TPOT SLA for their 321B-A38B MoE model Step3 served with H800 The implementation is based on vLLM and we are working together to bring it to the public Kudos to @StepFun_ai ๐ Check out their tech report at . https://github.com/stepfun-ai/Step3 https://github.com/stepfun-ai/Step3" [X Link](https://x.com/vllm_project/status/1948917913895010751) 2025-07-26T01:26Z 31.2K followers, 33.6K engagements "๐ฅ vLLM @ PyTorch Conference [----] ๐ฅ Were excited to share that [--] talks at this years PyTorch Conference will feature vLLM Topics include: Easy & Fast LLM Serving Open-Source Post-Training Stack Scaling Online LLM Training AMD GPU support via Triton vllm-triton backend performance Stay tuned & come say hi ๐ #vLLM #PyTorch #LLM #AI #opensource ๐ ICYMI: The #PyTorchConf schedule is now live https://t.co/YSAdVaiWRk Were talking [--] days of cutting-edge talks on #LLM scaling real-time inference model optimization & more straight from the #PyTorch community. ๐ Oct [----] San Francisco ๐ Register:" [X Link](https://x.com/vllm_project/status/1950821700679192654) 2025-07-31T07:31Z 31.2K followers, 29.1K engagements "Thank you @OpenAI for open-sourcing these great models ๐ Were proud to be the official launch partner for gpt-oss (20B & 120B) now supported in vLLM ๐ โก MXFP4 quant = fast & efficient ๐ Hybrid attention (sliding + full) ๐ค Strong agentic abilities ๐ Easy deployment ๐๐ป Check out the blog and recipes for more details ๐ฅ #vLLM #gptOSS #openai https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html https://blog.vllm.ai/2025/08/05/gpt-oss.html Our open models are here. Both of them. https://t.co/9tFxefOXcg https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html" [X Link](https://x.com/vllm_project/status/1952784530466849091) 2025-08-05T17:31Z 31.2K followers, 44.3K engagements "๐ we care a lot about correctness ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified if you run into any correctness issue on vLLM we would love to know and debug them Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. Were working with inference providers to make sure gpt-oss performs at its best everywhere and wed love your feedback Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across" [X Link](https://x.com/vllm_project/status/1952940603773468926) 2025-08-06T03:51Z 31.2K followers, 66.7K engagements "Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug We have collaborated with the @nvidia team to investigate how cuda core dump can help check out the blogpost to learn more https://blog.vllm.ai/2025/08/11/cuda-debugging.html https://blog.vllm.ai/2025/08/11/cuda-debugging.html" [X Link](https://x.com/vllm_project/status/1955478388178817298) 2025-08-13T03:55Z 31.4K followers, 34K engagements "๐ Amazing community project vLLM CLI a command-line tool for serving LLMs with vLLM: โ Interactive menu-driven UI & scripting-friendly CLI โ Local + HuggingFace Hub model management โ Config profiles for perf/memory tuning โ Real-time server & GPU monitoring โ Error logs & recovery ๐ฆ Install in one line: pip install vllm-cli GitHub: ๐ Would you like to see these features in vLLM itself Try it out & share feedback https://github.com/Chen-zexi/vllm-cli https://github.com/Chen-zexi/vllm-cli" [X Link](https://x.com/vllm_project/status/1957002590220431669) 2025-08-17T08:52Z 31.2K followers, 72.8K engagements "๐ GLM-4.5 meets vLLM @Zai_org 's latest GLM-4.5 & GLM-4.5V models bring hybrid reasoning coding & intelligent agent capabilitiesnow fully supported in vLLM for fast efficient inference on NVIDIA Blackwell & Hopper GPUs Read more ๐ https://blog.vllm.ai/2025/08/19/glm45-vllm.html https://blog.vllm.ai/2025/08/19/glm45-vllm.html" [X Link](https://x.com/vllm_project/status/1957731795887353895) 2025-08-19T09:10Z 31.2K followers, 16.9K engagements "๐ Exciting news: DeepSeek-V3.1 from @deepseek_ai now runs on vLLM ๐ง Seamlessly toggle Think / Non-Think mode per request โก Powered by vLLMs efficient serving scale to multi-GPU with ease ๐ Perfect for agents tools and fast reasoning workloads ๐ Guide & examples: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_1.html Introducing DeepSeek-V3.1: our first step toward the agent era ๐ ๐ง Hybrid inference: Think & Non-Think one model two modes โก Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 ๐ Stronger agent skills: Post-training" [X Link](https://x.com/vllm_project/status/1958580047658491947) 2025-08-21T17:20Z 31.2K followers, 23.6K engagements "Wow glad to see vLLM powers @jiawzhao 's DeepConf work impressive results on AIME [----] Do you think this sampling control makes sense Have a try and leave a comment in that PR to let us know https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf: Deep Think with Confidence ๐ First method to achieve 99.9% on AIME [----] with open-source models Using GPT-OSS-120B even without tools we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong https://t.co/MlhDUKmawH https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf:" [X Link](https://x.com/vllm_project/status/1959277423729500565) 2025-08-23T15:31Z 31.3K followers, 48.3K engagements "๐ vLLM Shanghai Meetup Recap ๐ Last weekend we gathered with the community in Shanghai to dive into: Contributing to vLLM Distributed inference ERNIE [---] integration Mooncake + LMCache MetaX hardware support The community is pushing vLLM to new levels of performance scalability & adaptability. ๐ Event notes: ๐ Slides: https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg" [X Link](https://x.com/vllm_project/status/1959903380006175194) 2025-08-25T08:59Z 31.2K followers, 12.1K engagements "๐ LLM Compressor v0.7.0 is here This release brings powerful new features for quantizing large language models including transform support (QuIP SpinQuant) mixed precision compression improved MoE handling with Llama4 support and more. Full blog: More info below ๐ https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap" [X Link](https://x.com/vllm_project/status/1960432740672921934) 2025-08-26T20:02Z 31.2K followers, 16.8K engagements "๐ vLLM now supports Kwai Keye-VL-1.5 With sharper video ๐น & image ๐ผ comprehension stronger reasoning and an extended 128K context length this model unlocks richer conversations and more complex tasks than ever before. Upgrade to the nightly build and experience it today Check out for more details. #AI #vLLM #KeyeVL https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B" [X Link](https://x.com/vllm_project/status/1962509793345859666) 2025-09-01T13:36Z 31.2K followers, 10.7K engagements "Amazing blogpost from @gordic_aleksa explaining internals of vLLM๐ New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of" [X Link](https://x.com/vllm_project/status/1962547561698652499) 2025-09-01T16:06Z 31.4K followers, 32.8K engagements "vLLM is proud to support the great Kimi update from @Kimi_Moonshot better tool-calling longer context and more Check the deployment guide at ๐ฅ https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/deploy_guidance.md Kimi K2-0905 update ๐ - Enhanced coding capabilities esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g. Claude Code Roo Code etc) ๐ Weights & code: https://t.co/83sQekosr9 ๐ฌ Chat with new Kimi https://t.co/mkOuBMwzpw" [X Link](https://x.com/vllm_project/status/1963805972352188895) 2025-09-05T03:26Z 31.4K followers, 18.6K engagements "The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost (after more proofreading and clarifications) Looking forward to future series of tech deep dive blogposts๐ https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog" [X Link](https://x.com/vllm_project/status/1965226648560935126) 2025-09-09T01:31Z 31.3K followers, 48.2K engagements "Wow thanks to @charles_irl you can understand internals of vLLM with a live notebook from @modal ๐ฅฐ I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM" [X Link](https://x.com/anyuser/status/1965708611222684017) 2025-09-10T09:27Z 31.2K followers, 30.7K engagements "โก Efficient weight updates for RL at trillion-parameter scale ๐ก Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: Broadcast weight sync for 1T params in 20s across 1000s of GPUs Dynamic P2P updates for elastic clusters Optimized pipeline w/ overlapped H2D broadcast & reload Open source & ready for large-scale RL with vLLM ๐ Introducing checkpoint-engine: our open-source lightweight middleware for efficient in-place weight updates in LLM inference engines especially effective for RL. โ Update a 1T model on thousands of GPUs in 20s โ Supports both" [X Link](https://x.com/vllm_project/status/1965824120920342916) 2025-09-10T17:06Z 31.2K followers, 24K engagements "Thank you @cHHillee for the great explanation and demo of how to implement deterministic inference on vLLM https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is Defeating Nondeterminism in LLM Inference We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to https://t.co/jMFL3xt67C https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is" [X Link](https://x.com/vllm_project/status/1965832975297503662) 2025-09-10T17:41Z 31.2K followers, 16.3K engagements "Deep dive into optimizing weight transfer step by step and improving it 60x [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH" [X Link](https://x.com/vllm_project/status/1965879465688576110) 2025-09-10T20:46Z 31.3K followers, 18.5K engagements "Welcome Qwen3-Next You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only 3B activated per token 10x cheaper training 10x faster inference than Qwen3-32B.(esp. @ 32K+ context) ๐นHybridArchitecture:GatedDeltaNet+GatedAttentionbestofspeed& https://t.co/yO7ug721U6 https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only" [X Link](https://x.com/vllm_project/status/1966224816777928960) 2025-09-11T19:38Z 31.4K followers, 61.7K engagements "v0.10.2 marks the first release with official aarch64 support for vLLM You can now install vLLM directly onto @nvidia 's GB200. Along with the PyPI release our docker image is also multi-platform so pulling the right image just works. More perf enhancements on the way" [X Link](https://x.com/vllm_project/status/1967752683458269282) 2025-09-16T00:49Z 31.3K followers, 17.9K engagements "Congrats to @deepseek_ai DeepSeek-R1 was published in Nature yesterday as the cover article and vLLM is proud to have supported its RL training and inference๐ฅฐ" [X Link](https://x.com/anyuser/status/1968506474709270844) 2025-09-18T02:44Z 31.2K followers, 214.3K engagements "Pro Tip๐กFast and simple way to deploy DeepSeek-V3.1-Terminus with vLLM โก Run it with: vllm serve deepseek-ai/DeepSeek-V3.1-Terminus -tp [--] -dcp [--] (as simple as appending -dcp [--] after -tp 8) Thanks to the @Kimi_Moonshot team vLLM 0.10.2 adds Decode Context Parallel (DCP) support: โ Cuts KV cache duplication by sharding across GPUs โ [--] larger KV cache โ [--] throughput gain on single-node H200 Perfect for KV-cache hungry tasks (RL offline data generation). More blogposts diving into DCP are coming soonand optimizations for general GQA models are on the way ๐ #vLLM #DeepSeek #AIInfra ๐" [X Link](https://x.com/vllm_project/status/1970814441718755685) 2025-09-24T11:35Z 31.3K followers, 44K engagements "๐ New in vLLM: dots.ocr ๐ฅ A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM ๐ Single end-to-end parser for text tables (HTML) formulas (LaTeX) and layouts (Markdown) ๐ Supports [---] languages with robust performance on low-resource docs โก Compact 1.7B VLM but achieves SOTA results on OmniDocBench & dots.ocr-bench โ Free for commercial use Deploy it in just two steps: uv pip install vllm --extra-index-url vllm serve rednote-hilab/dots.ocr --trust-remote-code Try it today and bring fast accurate OCR to your pipelines. Which models would you like" [X Link](https://x.com/vllm_project/status/1972275216954073498) 2025-09-28T12:20Z 31.4K followers, 70.6K engagements "How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of [---] per token (vs. [---] for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA. ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ Built on V3.1-Terminus it debuts DeepSeek Sparse Attention(DSA) for faster more efficient training & inference on long context. ๐ Now live on App Web and API. ๐ฐ API prices cut by 50%+ 1/n ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ" [X Link](https://x.com/vllm_project/status/1972617272901644345) 2025-09-29T10:59Z 31.2K followers, 103.3K engagements "Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai vLLM is here to help We have verified that it works on H200 machines and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes for more details ๐ Note: currently the PR is still pending but you can use our pre-compiled wheels to build directly and use the model We will push the model into main branch very soon and add many more optimizations. Stay tuned https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the" [X Link](https://x.com/vllm_project/status/1972664010702221399) 2025-09-29T14:05Z 31.2K followers, 12.5K engagements "Keeping BERT alive in vLLM via transformers Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE" [X Link](https://x.com/vllm_project/status/1973805307878142297) 2025-10-02T17:40Z 31.3K followers, 17.8K engagements "๐ The RL community keeps pushing boundaries from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild but thats exactly what PipelineRL makes work. vLLM is proud to power this kind of modular cutting-edge RL innovation. Give it a try and share your thoughts I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences Just update the weights and" [X Link](https://x.com/vllm_project/status/1974732295627301254) 2025-10-05T07:04Z 31.4K followers, 89K engagements "๐ vLLM x MinerU: Document Parsing at Lightning Speed Were excited to see MinerU fully powered by vLLM bringing ultra-fast accurate and efficient document understanding to everyone. โก Powered by vLLMs high-throughput inference engine MinerU [---] delivers: Instant parsing no waiting Deeper understanding for complex docs Optimized cost even consumer GPUs can fly Experience the new speed of intelligence: ๐ #vLLM #MinerU #AI #LLM #DocumentParsing #AIresearch https://github.com/opendatalab/MinerU MinerU [---] has arrived with demo on @huggingface https://t.co/KZyL6TJDSe" [X Link](https://x.com/vllm_project/status/1976968858415005841) 2025-10-11T11:11Z 31.2K followers, 71.7K engagements "๐ vLLM just hit 60K GitHub stars ๐ From a small research idea to powering LLM inference everywhere across NVIDIA AMD Intel Apple TPUs and more vLLM now supports almost all major text-generation models and native RL pipelines like TRL Unsloth Verl and OpenRLHF. Huge thanks to our amazing community and friends at @PyTorch @huggingface Transformers and model vendors from @AIatMeta Llama to @OpenAI GPT-OSS @Alibaba_Qwen Qwen @deepseek_ai DeepSeek and @Kimi_Moonshot Kimi and many others (sorry we ran out of space) for making this ecosystem thrive. โค Let's head to the next chapter of efficient" [X Link](https://x.com/vllm_project/status/1977724334157463748) 2025-10-13T13:13Z 31.2K followers, 39.1K engagements "Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. ๐ What's New - JAX + Pytorch: Run PyTorch models on TPUs with no code changes now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program Multi-Data (SPMD) as the default a" [X Link](https://x.com/vllm_project/status/1978855648176853100) 2025-10-16T16:08Z 31.2K followers, 157.8K engagements "kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first library for elastic GPU sharing across LLMs. ๐ https://t.co/3BC7B6s2EX ๐งต๐ Why it matters: https://t.co/jdIg1gyyOS ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first" [X Link](https://x.com/vllm_project/status/1980776841129701411) 2025-10-21T23:22Z 31.4K followers, 55.7K engagements "its tokenization again ๐คฏ did you know tokenize(detokenize(token_ids)) token_ids RL researchers from Agent Lightning coined the term Retokenization Drift a subtle mismatch between what your model generated and what your trainer thinks it generated. why because most agents call LLMs via OpenAI-compatible APIs that only return strings so when those strings get retokenized later token splits may differ (HAV+ING vs H+AVING) tool-call JSON may be reformatted or chat templates may vary. unstable learning off-policy updates training chaos. ๐ฌ (@karpathy has a great video explaining all details about" [X Link](https://x.com/vllm_project/status/1981017184769061153) 2025-10-22T15:17Z 31.2K followers, 171K engagements "๐ Excited to share our work on batch-invariant inference in vLLM Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill). Let's dive into how we built this ๐งต๐" [X Link](https://x.com/vllm_project/status/1981088861506982041) 2025-10-22T20:02Z 31.2K followers, 38.8K engagements "๐@Kimi_Moonshot co-founder @ppwwyyxx talking about Moonshots Decode Context Parallel open source contribution to @vllm_project at @PyTorch conf" [X Link](https://x.com/vllm_project/status/1981516101189324986) 2025-10-24T00:20Z 31.3K followers, 17.1K engagements "vLLM ๐ค @nvidia = open scalable agentic AI you can run anywhere. ๐งต Strengthening our partnership with @nvidia: vLLM serves the NVIDIA Nemotron family. This new blog from NVIDIA walks through how to deploy open high-accuracy agentic inference across data center and edgefast reproducible and production-ready. ๐ Model highlight Nemotron Nano [--] (9B): a small language reasoning model with a hybrid TransformerMamba design and a tunable thinking budget. Open weights + 9T tokens of open data on Hugging Face (permissive license). Excels at reasoning/coding instruction following tool calling and" [X Link](https://x.com/vllm_project/status/1981553870599049286) 2025-10-24T02:50Z 31.2K followers, 19.2K engagements Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
@vllm_project vLLMvLLM posts on X about inference, llm, ai, native the most. They currently have [------] followers and [---] posts still getting attention that total [-------] engagements in the last [--] hours.
Social category influence technology brands stocks finance social networks countries travel destinations gaming exchanges
Social topic influence inference #38, llm #744, ai, native #404, agentic, strong, the first, community, we are, pip
Top accounts mentioned or mentioned by @vllmproject @nvidia @deepseekai @kimimoonshot @huggingface @alibabaqwen @pytorch @nvidiaaidev @minimaxai @mistralai @redhatai @amd @ai_bridge_japan @zaiorg @aiatmeta @sparkycollier @aiatamd @intel @openai @redhat
Top assets mentioned IBM (IBM) Alphabet Inc Class A (GOOGL) Uber Technologies, Inc. (UBER)
Top posts by engagements in the last [--] hours
"One of the best part of vLLM is our amazing community Our issues and pull requests are ever increasing. We have members of the community helping closing stale issues. @hmellor_ helped closing more than [---] stales issues over the last month"
X Link 2024-04-08T18:48Z [---] followers, [---] engagements
"Speculative Decoding improves the end-to-end latency by using draft models or ngrams to propose multiple tokens before the large model verifies it. https://docs.vllm.ai/en/stable/models/spec_decode.html https://docs.vllm.ai/en/stable/models/spec_decode.html"
X Link 2024-06-14T00:51Z [----] followers, [---] engagements
"Multimodal is also getting more important vLLM now has OpenAI compatible API for vision models. While only LLaVA and LLaVA-NeXT is supported there are 5+ in-flight PRs adding new models. https://docs.vllm.ai/en/stable/models/vlm.html https://docs.vllm.ai/en/stable/models/vlm.html"
X Link 2024-06-14T00:51Z [----] followers, [---] engagements
"Join us on Monday September [--] at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team. https://lu.ma/87q3nvnh https://lu.ma/87q3nvnh"
X Link 2024-08-08T18:32Z [----] followers, [----] engagements
"Beyond the dedicated vLLM track at #RaySummit we are co-hosting a meet-and-greet with @AMD and @anyscalecompute on the training day Sept 30th to kick off the summit. This will be the first event that you can learn about AMD MI300X's performance on vLLM https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5 https://lu.ma/db5ld9n5"
X Link 2024-09-16T19:03Z [----] followers, [----] engagements
"@Roblox not only adopts vLLM to innovate AI enabled product but also actively contributse to the open source ecosystemโค "We have adopted vLLM as our primary inference engine for LLMs leveraging vLLMs high-performance capabilities to power AI applications across Roblox. Since moving to vLLM weve seen an almost 2x improvement in both latency and throughput and we currently serve approximately [--] billion tokens per week." "Our choice of vLLM aligns with our commitment to leveraging open-source and cutting-edge technologies that can scale efficiently to meet the demands of our vast user base and"
X Link 2024-09-18T01:21Z [----] followers, [----] engagements
"Join the vLLM team at #RaySummit (Oct [--] 2) for the first ever "The State of vLLM 2024" talk recap our progress over the last year and looking forward to the future by @zhuohan123 and @KuntaiDu . This talk will kick start the track of vLLM talks from @anyscalecompute @Roblox @neuralmagic @IBMResearch @Apple @Uber @intel @databricks and others Registration link: (use code AnyscalevLLM15 for 15% off) http://raysummit.anyscale.com http://raysummit.anyscale.com"
X Link 2024-09-20T20:57Z [----] followers, [----] engagements
"We are utilizing PyTorch as a narrow waist for hardware abstractions"
X Link 2024-09-20T22:33Z [----] followers, [---] engagements
"Currently both AMD GPU and Google TPU utilize PyTorch's native ops and custom backends"
X Link 2024-09-20T22:33Z [----] followers, [---] engagements
"vLLM contributor @KaichaoYou will speak at IBM TechXchange on Oct [--] in Las Vegas come and discuss anything about vLLM"
X Link 2024-09-25T03:01Z [----] followers, [----] engagements
"๐ We just created a vLLM developer Slack workspace to discuss features coordinate collaborations and bring the users community together. Looking forward to see you there http://slack.vllm.ai http://slack.vllm.ai"
X Link 2024-10-08T00:22Z [----] followers, [----] engagements
"๐คฉ checkout this blog from @awscloud about scaling Rufus which is powered by @vllm_project on Inferentia and Trainium Serving [--] million tokens a minute "Within each container an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput." "These choices allowed Rufus to scale up over [-----] Trainium and Inferentia chips across three Regions serving an average of [--] million tokens a minute while maintaining P99 less than [--] second latency to the first response for"
X Link 2024-10-10T19:16Z [----] followers, [----] engagements
"๐The amazing folks at @EmbeddedLLM and @HotAisle have performed extensive tuning and benchmarking of vLLM on @AMD's MI300X GPU showcase leading performance. With [---] GB HBM3 per GPU MI300X is a great choice for large models such as 405B. https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html"
X Link 2024-10-29T19:32Z 18.7K followers, [----] engagements
"We reached 30K GitHub stars For all the stargazers what do you want to see in vLLM in [----] ๐"
X Link 2024-11-15T02:46Z [----] followers, [----] engagements
"what would you like vLLM to build/fix in [----] :D"
X Link 2024-12-25T22:43Z [----] followers, [----] engagements
"The first vLLM meetup in [----] is happening in two weeks on January 22nd Wednesday with @googlecloud in SF We will talk about vLLM's performant V1 architecture Q1 roadmap Google Cloud's innovation around vLLM: networking Cloud Run Vertex and TPU https://lu.ma/zep56hui https://lu.ma/zep56hui https://lu.ma/zep56hui https://lu.ma/zep56hui"
X Link 2025-01-08T20:11Z [----] followers, [----] engagements
"We focus our benchmark on long generation workload. On 8xH200 we are seeing 40% increase in generation throughput from the optimized fp8 kernels and 3.4x enhancement from MLA. On TP8PP2 settings with H100 we see 26% increase from the fp8 kernels and 2.8x increase from MLA"
X Link 2025-02-01T23:46Z [----] followers, [----] engagements
"The throughput increase comes from memory savings. MLA offers 9.6x memory capacity for KV caches which increases the batch size. For example on 8xH200 we increased from [-----] tokens to [------] tokens which means batch size of [--] to 128"
X Link 2025-02-01T23:46Z [----] followers, [----] engagements
"@zjasper666 @deepseek_ai Here's the batch size [--] performance: https://x.com/vllm_project/status/1885837180960211349 However do note that MHA is faster for generation than MLA actually in low qps settings. This is a current limitation we are addressing. Take the two settings we have under a single request we see MHA is worse time-to-first-token (TTFT) but higher time-per-output-token (TPOT) https://t.co/LmZjLih1KJ https://x.com/vllm_project/status/1885837180960211349 However do note that MHA is faster for generation than MLA actually in low qps settings. This is a current limitation we are"
X Link 2025-02-02T05:44Z [----] followers, [---] engagements
"We are welcoming AIBrix to vLLM organization It is a battery-included vLLM Kubernetes serving stack developed by ByteDance. https://blog.vllm.ai/2025/02/21/aibrix-release.html https://blog.vllm.ai/2025/02/21/aibrix-release.html"
X Link 2025-02-21T23:55Z [----] followers, [----] engagements
"๐ Join us at the SF AIBrix & vLLM Meetup on June 18th at AWS SF GenAI Loft Learn from experts at ByteDance AWS Neuron and EKS. Discover AIBrix: a scalable cost-effective control plane for vLLM. Talks Q&A pizza and networking ๐๐ค https://lu.ma/ab2id296 https://lu.ma/ab2id296"
X Link 2025-06-03T17:25Z 14.2K followers, [----] engagements
"Cool to see @vllm_project used as part of @WhatsApp Trusted Execution Environment (TEE) Private Processing Paper here: https://ai.meta.com/static-resource/private-processing-technical-whitepaper https://ai.meta.com/static-resource/private-processing-technical-whitepaper"
X Link 2025-06-14T00:34Z 14.2K followers, [----] engagements
"Minimax M1 is one of the SOTA open weight model from @MiniMax__AI. Checkout how is it efficiently implemented in vLLM directly from the team https://blog.vllm.ai/2025/06/30/minimax-m1.html ๐ฅ Another strong open model with Apache [---] license this one from @MiniMax_AI - places in the top [--]. MiniMax-M1 is now live on the Text Arena leaderboard landing at #12. This puts it at equal ranking with Deepseek V3/R1 and Qwen [--] See thread to learn more about its https://t.co/j14gU5D37O https://blog.vllm.ai/2025/06/30/minimax-m1.html ๐ฅ Another strong open model with Apache [---] license this one from"
X Link 2025-07-01T04:17Z 14.6K followers, 19K engagements
"We started a channel in the vLLM Slack (join via and let's discuss tips and ways to address this http://slack.vllm.ai http://slack.vllm.ai"
X Link 2025-07-06T21:30Z 14.6K followers, [----] engagements
"Intern-S1 is supported in vLLM now thanks for the joint efforts between the vLLM team and the InternLM team @intern_lm โฅ The easy way: uv pip install vllm --extra-index-url vllm serve internlm/Intern-S1 --tensor-parallel-size [--] --trust-remote-code https://wheels.vllm.ai/nightly ๐Introducing Intern-S1 our most advanced open-source multimodal reasoning model yet ๐ฅณStrong general-task capabilities + SOTA performance on scientific tasks rivaling leading closed-source commercial models. ๐ฅฐBuilt upon a 235B MoE language model and a 6B Vision encoder. https://t.co/0htivKiv3k"
X Link 2025-07-26T16:37Z 15.4K followers, [----] engagements
"@QuixiAI @Kimi_Moonshot @huggingface @casper_hansen_ What's the problem you get on vllm Please file an issue to discuss"
X Link 2025-08-05T16:22Z 16K followers, [---] engagements
"Fun demo of gpt-oss on @vllm_project Want to see our open models in action Watch how gpt-oss builds a video gameusing tools step-by-step within chain-of-thought reasoning ๐พ๐ https://t.co/WNeV0cpwM2 Want to see our open models in action Watch how gpt-oss builds a video gameusing tools step-by-step within chain-of-thought reasoning ๐พ๐ https://t.co/WNeV0cpwM2"
X Link 2025-08-05T20:59Z 16K followers, [----] engagements
"@code_star The openai-harmony=0.1.0 should be hosted as part of wheel index this morning and it is a pre-release wheel. About an hour ago we also updated the wheel to use any version of openai-harmony available. https://wheels.vllm.ai/gpt-oss https://wheels.vllm.ai/gpt-oss"
X Link 2025-08-05T23:08Z 16.1K followers, [---] engagements
"@casper_hansen_ maybe you will be interested in verl team @verl_project already builds their RL pipeline on this token-in-token-out usage of vLLM. https://verl.readthedocs.io/en/latest/advance/agent_loop.html https://verl.readthedocs.io/en/latest/advance/agent_loop.html"
X Link 2025-08-10T10:37Z 16.2K followers, [----] engagements
"Join us for the vLLM Meetup in Singapore on Aug [--] [----] (Wednesday) 6-8:30 PM @SGInnovateWe will discuss efficient LLM inference with talks on: - @EmbeddedLLM on Latest vLLM advancements - @AMD on optimizing inference on Data Center GPUs - @WekaIO presenting vLLM + LMCache + SSD for high-performance KV cache - @ASTARsg MERaLiON team on deploying AudioLLM with vLLM+Ray for autoscaling and load balancing Followed by Q&A ๐ท networking. Spaces limitedRSVP at https://www.sginnovate.com/event/vllm-sg-meet https://www.sginnovate.com/event/vllm-sg-meet"
X Link 2025-08-11T04:54Z 16.6K followers, [----] engagements
"When you chat with Rufus on Amazon app it is powered by vLLM https://aws.amazon.com/blogs/machine-learning/how-amazon-scaled-rufus-by-building-multi-node-inference-using-aws-trainium-chips-and-vllm/ https://aws.amazon.com/blogs/machine-learning/how-amazon-scaled-rufus-by-building-multi-node-inference-using-aws-trainium-chips-and-vllm/"
X Link 2025-08-14T22:10Z 16.6K followers, [----] engagements
"We are co-hosting gpt-oss meetup in San Francisco next Wednesday 08/27 at 5:30pm together with @OpenAI and @ollama at @ycombinator office RSVP here https://lu.ma/gpt-oss https://lu.ma/gpt-oss"
X Link 2025-08-22T05:11Z 17.2K followers, [----] engagements
"@LinkedIn not only uses vLLM at massive scale but also actively contributes to the community checkout their wonderful blog: https://www.linkedin.com/blog/engineering/ai/how-we-leveraged-vllm-to-power-our-genai-applications https://www.linkedin.com/blog/engineering/ai/how-we-leveraged-vllm-to-power-our-genai-applications"
X Link 2025-08-26T20:07Z 16.8K followers, [--] engagements
"@casper_hansen_ @sgl_project There's a guide from the rednote team and we will work with team for upstreaming https://huggingface.co/rednote-hilab/dots.ocr#vllm-inference https://huggingface.co/rednote-hilab/dots.ocr#vllm-inference"
X Link 2025-09-10T17:56Z 17.6K followers, [---] engagements
"@thinkymachines is building a vLLM team You will be working with @woosuk_k and many great folks to build the worlds frontier inference engine At Thinking Machines our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested please DM me or @barret_zoph At Thinking Machines our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to"
X Link 2025-09-11T21:39Z 17.8K followers, [----] engagements
"Step [--] Top-k Selection For each query the indexer selects its top-2048 tokens in the context to attend to. (If context length [----] all tokens are included.)"
X Link 2025-09-29T10:59Z 18.9K followers, [----] engagements
"Step [--] Sparse MLA With selected token indices the FlashMLA sparse kernel performs efficient attention skipping irrelevant positions"
X Link 2025-09-29T10:59Z 18.9K followers, [----] engagements
"Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs and we cannot do it without the amazing team from NVIDIA. ๐ฃ We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM the open source ๐ฃ We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer"
X Link 2025-10-01T04:09Z 19.6K followers, [----] engagements
"@CatGodSandHive check out their paper for more details"
X Link 2025-10-05T07:15Z 19K followers, [---] engagements
"@ScentScai @OpenAI @nvidia @huggingface @lmstudio @devpost congrats"
X Link 2025-10-14T01:47Z 19.6K followers, [---] engagements
"Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying PyTorch and JAX under a single lowering path for amazing performance and flexibility"
X Link 2025-10-16T15:54Z 19.6K followers, [--] engagements
"We hear your voice For minimax in particular forces the lm head to be fp32 which resumes accuracy but takes a lot memory. We are experimenting to see if dynamically casting fp16/bf16 to fp32 in the kernel helps the accuracy of the logits. https://github.com/vllm-project/vllm/pull/19592 horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh https://github.com/vllm-project/vllm/pull/19592 horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh"
X Link 2025-07-07T02:35Z 25.7K followers, [----] engagements
"@winglian Thank you for the investigation. We will bump the version immediately so you can use vLLM's nightly wheel with PyTorch 2.7.1 and it will be the default for next release"
X Link 2025-07-15T19:09Z 25.7K followers, [----] engagements
"๐ฅ Step [--] support just landed in vLLM Both the 321B text-only and vision-language models are now fully supported โ Step [--] is a blazing-fast cost-effective VLM with MFA & AFD for efficient inferenceeven on modest GPUs. ๐ Up to [----] tok/sec/GPU We will optimize it further in the futureโก ๐ Deployment guide ๐๐ป #vLLM #Step3 #OpenSourceAI #VLM https://huggingface.co/stepfun-ai/step3/blob/main/docs/deploy_guidance.md ๐ Announcing Step 3: Our latest open-source multimodal reasoning model is here Get ready for a stronger faster & more cost-effective VLM ๐ต 321B parameters (38B active) optimized"
X Link 2025-07-31T16:18Z 25.6K followers, [----] engagements
"With v0.7.0 you can apply multiple compressors to a single model. This enables mixing formats like NVFP4 and FP8 giving you fine-grained control over quantization strategies and better handling of sensitive layers. Block FP8 (DeepSeekV3-style) is also supported"
X Link 2025-08-26T20:02Z 28.3K followers, [----] engagements
"MoE calibration is now more flexible with support for NVFP4 quantized experts. Llama4 models can now be quantized and run with vLLM using GPTQ or NVFP4. Includes support for WN16. Full release notes ๐ท https://github.com/vllm-project/llm-compressor/releases/tag/0.7.0 https://github.com/vllm-project/llm-compressor/releases/tag/0.7.0"
X Link 2025-08-26T20:02Z 28.2K followers, [----] engagements
"Step [--] Query & Key Projection Hidden states are projected into query/key space (with rotary embeddings). A new addition: per-head weights are also projected from hidden statesthese will reweight logits later"
X Link 2025-09-29T10:59Z 26.6K followers, [----] engagements
"@sundeep @IBMwatsonx @GroqInc @IBM @ArvindKrishna @robdthomas congrats is this a hardware plugin that will be open-sourcedโค"
X Link 2025-10-20T15:47Z 23.2K followers, [----] engagements
"๐คฏ Self improving FlashInfer kernels We look forward to the integration and make it shine on real world workloads. ๐ค Can AI optimize the systems it runs on ๐ Introducing FlashInfer-Bench a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving https://t.co/FU9JjJ1vmf ๐ค Can AI optimize the systems it runs on ๐ Introducing FlashInfer-Bench a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels"
X Link 2025-10-21T19:44Z 25.3K followers, [----] engagements
"@bdambrosio yes it's supported you can pass in token ids directly in v1/completions endpoint"
X Link 2025-10-22T17:02Z 23.9K followers, [----] engagements
"The probability sampled from the model should be exactly the same whether you're running a single request batching multiple requests together or prefilling with generated tokens"
X Link 2025-10-22T20:02Z 24.1K followers, [----] engagements
"Implementation took [--] main approaches: [--] Building custom operator implementations [--] Overriding and monkey-patching execution throughout the stack [--] Carefully tweaking some existing backends Each piece was crucial to achieving batch-level bitwise invariance"
X Link 2025-10-22T20:02Z 24.1K followers, [---] engagements
"Testing Strategy: Split tests into prefill vs. prefill+decode (vLLM has different code paths). Ran mix of big/small requests at bs=1 and bs=N checking logprobs are exactly the same. No tolerance for approximation"
X Link 2025-10-22T20:02Z 23.8K followers, [---] engagements
"Debugging Benefits: With proper batch lane tracking we could check for exact bitwise equivalence to isolate non-invariant execution. Best part Batch-invariance makes debugging SO much easier and more rigorous going forward ๐ฏ"
X Link 2025-10-22T20:02Z 24K followers, [----] engagements
"๐๐๐ DeepSeek-OCR is now officially supported in upstream vLLM with the custom logits processor as a core component of the model release The integration has been validated by the original author and here's a dedicated recipe to get started. Check it out https://t.co/xEJ5DDz8zl DeepSeek-OCR is now officially supported in upstream vLLM with the custom logits processor as a core component of the model release The integration has been validated by the original author and here's a dedicated recipe to get started. Check it out https://t.co/xEJ5DDz8zl"
X Link 2025-10-23T01:16Z 26K followers, [----] engagements
"vLLM's sleep mode ( see ) can be leveraged to implement a similar system to Aegaeon to cut down the GPU cost of model market like @huggingface ๐ In fact Aegaeon is also based on vLLM https://x.com/vllm_project/status/1983069225460650103 What DeepSeek did to AI software Alibaba Cloud is doing to AI hardware. By enabling a GPU to serve multiple LLMs at the same time Alibaba is able to cut the number of (Nvidia) GPUs by whopping 82%. Its like Japanese cars v. American gas-guzzlers in the 1970s. Aegaeon is https://t.co/1Y9zlY8ps5 https://x.com/vllm_project/status/1983069225460650103 What"
X Link 2025-10-28T07:45Z 24.5K followers, [----] engagements
"Our first official vLLM Meetup is coming to Europe on Nov [--] Meet vLLM committers @mgoin_ @tms_jr Thomas Parnell + speakers from @RedHat_AI @IBM @MistralAI. Topics: vLLM updates quantization Mistral+vLLM hybrid models distributed inference https://luma.com/0gls27kb https://luma.com/0gls27kb"
X Link 2025-10-28T21:54Z 25.2K followers, [----] engagements
"๐ Congrats to @Zai_org on the release of their latest visual-text compression framework ๐ Don't miss the amazing official end-to-end demo - powered by vLLMs efficient model deployment for lightning-fast performanceโก Want to deploy your own Check out our model usage guide for more tips https://docs.vllm.ai/projects/recipes/en/latest/GLM/Glyph.html Glyph: Scaling Context Windows via Visual-Text Compression Paper: https://t.co/oxDsNJZXRz Weights: https://t.co/IYn2pAQjAn Repo: https://t.co/lFW7ajnHCk Glyph is a framework for scaling the context length through visual-text compression. It renders"
X Link 2025-10-29T01:33Z 24.5K followers, [----] engagements
"๐ฅ Following our big announcement heres the full vLLM takeover at Ray Summit [----] ๐ San Francisco Nov [--] Hosted by @anyscalecompute Get ready for deep dives into high-performance inference unified backends prefix caching MoE serving and large-scale orchestration. Save this schedule ๐ ๐ November [--] vLLM State + Scaling Track ๐ค Simon Mo State of vLLM [----] โก Kaustubh Rao FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels. ๐ข Deepak Chandramouli Ankur Goenka Rehan Durrani: vLLM in Apple Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal"
X Link 2025-10-29T13:06Z 25.5K followers, [----] engagements
"๐Excited to team up with @NVIDIAAIDev to bring Nemotron Nano [--] VL to vLLM - a multimodal model powered by a hybrid TransformerMamba language backbone built for video understanding and document intelligenceโจ Full post https://blog.vllm.ai/2025/10/31/run-multimodal-reasoning-agents-nvidia-nemotron.html https://blog.vllm.ai/2025/10/31/run-multimodal-reasoning-agents-nvidia-nemotron.html"
X Link 2025-10-31T19:01Z 26K followers, [----] engagements
"Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from read on Thank you @haoailab for pushing the idea forward. ๐ฅ New Blog: Disaggregated Inference: [--] Months Later [--] months in LLM inference feels like a new Moores Law cycle but this time not just 2x per year: ๐ธ Serving cost 10100x ๐ Throughput 10x โก Latency 5x A big reason Disaggregated Inference. From DistServe our ๐ฅ New Blog: Disaggregated Inference: [--] Months Later [--] months in LLM inference feels like a new Moores Law cycle but this time not just 2x per year: ๐ธ Serving"
X Link 2025-11-04T17:31Z 25.7K followers, [----] engagements
"๐ Learn how to deploy vLLM on NVIDIA DGX Spark the right way NVIDIA just published a detailed best practices guide for running high-throughput inference with vLLM including multi-node setups and optimized Docker builds. @NVIDIAAIDev ๐ Dive in: #vLLM #NVIDIA #DGXSpark #LLMInference #AIInfrastructure https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview https://build.nvidia.com/spark/vllm/overview"
X Link 2025-11-05T12:33Z 25.7K followers, [----] engagements
"Looking forward to integrate @perplexity_ai's fast communication kernels in vLLM Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well) Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)"
X Link 2025-11-05T17:14Z 25.3K followers, [----] engagements
"Join us Wednesday November [--] in Palo Alto to deep dive into vLLM and @AMD. Hear from @Meta as well on how are they using the AMD GPUs vLLM. vLLM + Meta + AMD Meetup: https://t.co/txkndkuLF6 vLLM + Meta + AMD Meetup: https://t.co/txkndkuLF6"
X Link 2025-11-06T18:10Z 25.5K followers, [----] engagements
"vLLM's plan for omni-modality This is a great achievement from the SGLang team I got immediately asked by a few people what vLLM's plan is. Multimodal generation with omni models is something we have been working on with the community and model vendors. As a sneak peek Hunyuan-image [---] (that beats This is a great achievement from the SGLang team I got immediately asked by a few people what vLLM's plan is. Multimodal generation with omni models is something we have been working on with the community and model vendors. As a sneak peek Hunyuan-image [---] (that beats"
X Link 2025-11-07T22:55Z 25.7K followers, [----] engagements
"Thanks to @github for spotlighting vLLM in the Octoverse [----] report one of the fastest-growing open-source AI projects this year. ๐ Top OSS by contributors ๐ Fastest-growing by contributors ๐ฑ Attracting the most first-time contributors Trusted by leading open model communities and industry partners including NVIDIA Meta Red Hat DeepSeek Qwen Moonshot and others vLLM has become a preferred engine for efficient LLM inference. With almost 63K stars and [----] contributors this growth belongs to the community. Together were building an easier faster and cheaper LLM serving for everyone. #vLLM"
X Link 2025-11-12T14:47Z 26.1K followers, 10.2K engagements
"๐ vLLM Party @ NeurIPS [----] San Diego Were teaming up with @RedHat_AI @AMD and @IBM to host an open-source community party at NeurIPS No agenda. No slides. Just: tacos drinks arcade games unlimited tokens and conversations with builders across the AI infra ecosystem. ๐ Dec [--] ๐ 18:0021:00 GMT-8 ๐ Coin-Op Game Room Register: Come for the tacos stay for the pinball. ๐น๐ฅ #vLLM #NeurIPS2025 #AIInfra #OpenSource https://luma.com/alpuytvr https://luma.com/alpuytvr"
X Link 2025-11-19T12:57Z 26.2K followers, [----] engagements
"๐ Docker Model Runner now integrates vLLM High-throughput LLM inference is now available with the same Docker workflow devs already use. Native safetensors support Automatic routing between llama.cpp (GGUF) and vLLM (safetensors) Works from laptops clusters with one CLI Bringing easy fast and affordable LLM serving to the @Docker ecosystem. Checkout Blog: https://blog.vllm.ai/2025/11/19/docker-model-runner-vllm.html Youve got a model running locally. Now make it repeatable shareable and scalable. Join our live webinar on Docker Model Runner to run manage and scale models with ease plus"
X Link 2025-11-20T15:51Z 26.2K followers, 53.2K engagements
"Need to customize vLLM Don't fork it. ๐ vLLM's plugin system lets you inject surgical modifications without maintaining a fork or monkey-patching entire modules. Blog by Dhruvil Bhatt from AWS SageMaker ๐ Why plugins forks: vLLM releases every [--] weeks with 100s of PRs merged Forks require constant rebasing & conflict resolution Monkey patches break on every vLLM upgrade How it works: Use VLLMPatchTargetClass for precise class-level mods Register via vllm.general_plugins entry point Control patches with env vars (VLLM_CUSTOM_PATCHES) Version-guard with min_vllm_version decorator Example: Add"
X Link 2025-11-21T15:10Z 26.2K followers, 21.6K engagements
"๐Congrats on the launch Speculators + vLLM brings a clean standardized path for speculative decoding making it easier to move from draft models to real production workloads. Excited to see how the community uses this to build faster more efficient inference systems. Were open-sourcing a set of high quality speculator models for Llamas Qwens and gpt-oss on Hugging Face. In real workloads you can expect [---] to 2.5x speedups and sometimes more than 4x. Heres how this fits into the bigger story for speculative decoding. A thread ๐งต: https://t.co/YCmbBqzqf6 Were open-sourcing a set of high"
X Link 2025-11-23T04:15Z 26.5K followers, 20.1K engagements
"๐ vLLM Talent Pool is Open As LLM adoption accelerates vLLM has become the mainstream inference engine used across major cloud providers (AWS Google Cloud Azure Alibaba Cloud ByteDance Tencent Baidu) and leading model labs (DeepSeek Moonshot Qwen). To meet the strong demand from top companies the vLLM community is now collecting resumes year-round and helping with referrals (internships & full-time). If you have experience in any of the following areas wed love to hear from you: RL frameworks & algorithms for LLMs Tool calling MCP Harmony format OpenAI/Anthropic API Structured output /"
X Link 2025-11-24T15:32Z 30.3K followers, 73.7K engagements
"Great to see more compact OCR models landing in open source lately and HunyuanOCR is a standout: a strong versatile 1B model with impressive real-world coverage. The vLLM community already provides Day-0 support so you can try it right away: https://docs.vllm.ai/projects/recipes/en/latest/Tencent-Hunyuan/HunyuanOCR.html We are thrilled to open-source HunyuanOCR an expert end-to-end OCR model built on Hunyuan's native multimodal architecture and training strategy. This model achieves SOTA performance with only [--] billion parameters significantly reducing deployment costs. โกBenchmark Leader:"
X Link 2025-11-25T12:10Z 26.5K followers, 13.8K engagements
"FP8 RL on consumer GPUs just got a boost๐ฅ Thrilled to team up with @UnslothAI and TorchAO to bring FP8 GRPO to vLLM: [---] faster RL inference 60% less VRAM [--] longer context and Qwen3-1.7B fitting in 5GB VRAM You can now run FP8 reinforcement learning on consumer GPUs Try DeepSeek-R1s FP8 GRPO at home using only a 5GB GPU. Qwen3-1.7B fits in 5GB VRAM. We collabed with PyTorch to make FP8 RL inference [---] faster. Unsloth: 60% less VRAM [--] longer context. https://t.co/YiBAUb8hz5 https://t.co/X4J6VmRMjY You can now run FP8 reinforcement learning on consumer GPUs Try DeepSeek-R1s FP8 GRPO at home"
X Link 2025-11-25T16:56Z 26.6K followers, 22.1K engagements
"Love this: a community contributor built vLLM Playground to make inferencing visible interactive and experiment-friendly. From visual config toggles to automatic command generation from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration it brings the whole vLLM lifecycle into one unified UX. Huge kudos to micyang for this thoughtful polished contribution. ๐ https://github.com/micytao/vllm-playground https://github.com/micytao/vllm-playground"
X Link 2025-11-30T07:06Z 26.6K followers, 24.7K engagements
"More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text image audio and video. Today were releasing vLLM-Omni: an open-source framework that extends vLLMs easy fast and cost-efficient serving to omni-modality models like Qwen-Omni and Qwen-Image with disaggregated stages for different model modules and components. If you know how to use vLLM you already know how to use vLLM-Omni. Blogpost: Code: Docs: Examples: https://github.com/vllm-project/vllm-omni/tree/main/examples"
X Link 2025-12-01T18:52Z 26.8K followers, 26.4K engagements
"Currently only supports Qwen-Omni and Qwen-Image and this is just the beginning. More models are coming. https://huggingface.co/Qwen/Qwen-Image https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text image audio and video. Today were releasing vLLM-Omni: an open-source framework that extends vLLMs easy fast and cost-efficient https://huggingface.co/Qwen/Qwen-Image https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct More inference workloads now mix"
X Link 2025-12-02T03:26Z 29.7K followers, 11.2K engagements
"๐ Congratulations to the Mistral team on launching the Mistral [--] family Were proud to share that @MistralAI @NVIDIAAIDev @RedHat_AI and vLLM worked closely together to deliver full Day-0 support for the entire Mistral [--] lineup. This collaboration enabled: NVFP4 (llm-compressor) optimized checkpoints Sparse MoE kernels for Mistral Large [--] Prefill/decode disaggregated serving Multimodal + long-context inference Efficient inference on A100 / H100 / Blackwell ๐ A huge thank-you to @MistralAI @NVIDIAAIDev and @RedHat_AI for the strong partnership and engineering effort that made Day-0"
X Link 2025-12-02T16:17Z 26.8K followers, 31.7K engagements
"LLM agents are powerful but can be slow at scale. @Snowflake's model-free SuffixDecoding from Arctic Inference now runs natively in vLLM beating tuned N-gram speculation across concurrency levels while keeping CPU and memory overhead in check. Quick Start in vLLM: https://docs.vllm.ai/en/stable/features/spec_decode/#speculating-using-suffix-decoding Suffix Decoding is at #NeurIPS2025 as a ๐
spotlight It accelerates LLM inference for coding agents and RL. We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang). Talk to @GabrieleOliaro or me at poster #816"
X Link 2025-12-03T08:11Z 26.5K followers, 11.2K engagements
"๐ค Proud to share the first production-ready vLLM plugin for Gaudi developed in close collaboration with the Intel team and fully aligned with upstream vLLM. ๐ง This release is validated and ready for deployment with support for the latest vLLM version coming soon. ๐ The @intel Gaudi team also completely revamped the plugin documentation to make onboarding even smoother. ๐ Release: ๐ Docs: #vLLM #Gaudi #Intel #OpenSource #AIInfra https://vllm-gaudi.readthedocs.io/ https://github.com/vllm-project/vllm-gaudi/releases/tag/v0.11.2 https://vllm-gaudi.readthedocs.io/"
X Link 2025-12-03T13:19Z 26.7K followers, [----] engagements
"๐ vLLM now offers an optimized inference recipe for DeepSeek-V3.2. โ Startup details Run vLLM with DeepSeek-specific components: --tokenizer-mode deepseek_v32 --tool-call-parser deepseek_v32 ๐งฐ Usage tips Enable thinking mode in vLLM: extra_body="chat_template_kwargs":"thinking": True Use reasoning instead of reasoning_content ๐ Special thanks to @TencentCloud for compute and engineering support. ๐ Full recipe (including how to properly use the thinking with tool calls feature): #vLLM #DeepSeek #Inference #ToolCalling #OpenSource"
X Link 2025-12-05T01:56Z 26.9K followers, 31.9K engagements
"๐ขvLLM v0.12.0 is now available. For inference teams running vLLM at the center of their stack this release refreshes the engine extends long-context and speculative decoding capabilities and moves us to a PyTorch 2.9.0 / CUDA [----] baseline for future work"
X Link 2025-12-05T14:18Z 26.8K followers, [----] engagements
"๐Congrats to the @Zai_org team on the launch of GLM-4.6V and GLM-4.6V-Flash with day-0 serving support in vLLM Recipes for teams who want to run them on their own GPUs. GLM-4.6V focuses on high-quality multimodal reasoning with long context and native tool/function calling while GLM-4.6V-Flash is a 9B variant tuned for lower latency and smaller-footprint deployments; our new vLLM Recipe ships ready-to-run configs multi-GPU guidance and production-minded defaults. If youre building inference services and want GLM-4.6V in your stack start here:"
X Link 2025-12-08T13:18Z 27.3K followers, 45.6K engagements
"Congrats to the @MistralAI team on the launch of Devstral [--] ๐ vLLM now delivers Day-0 support for the Devstral [--] Instruct models optimized for agentic coding deep codebase exploration and multi-file editing at scale. Feel free to reach out ๐ Introducing the Devstral [--] coding model family. Two sizes both open source. Also meet Mistral Vibe a native CLI enabling end-to-end automation. ๐งต Introducing the Devstral [--] coding model family. Two sizes both open source. Also meet Mistral Vibe a native CLI enabling end-to-end automation. ๐งต"
X Link 2025-12-09T16:25Z 26.9K followers, 19.3K engagements
"Low-bit LLM quantization doesnt have to mean painful accuracy trade-offs or massive tuning runs. Intel's AutoRound PTQ algorithm is now integrated into LLM Compressor producing W4A16 compressed-tensor checkpoints you can serve directly with vLLM across Intel Xeon Gaudi Arc GPUs and more. Huge thanks to the @intel Neural Compressor @RedHat_AI optimization LLM Compressor and vLLM community for making this integration happen. Repo: Blog: https://blog.vllm.ai/2025/12/09/intel-autoround-llmc.html https://github.com/vllm-project/llm-compressor"
X Link 2025-12-10T11:04Z 26.9K followers, 12.2K engagements
"vLLM was mentioned in about half of the PyTorch Conference [----] talks (53/117) Several months ago when the @PyTorch conference agenda was out we noticed that there would be [--] dedicated talks about vLLM. After the PyTorch conference we find that actually about half of the talks mentioned vLLM Check out for more details. These mentions can be roughly categorized as: - vLLM itself as a hosted project in PyTorch foundation including vLLM general introduction vLLM V1 on AMD GPUs vLLM-triton-backend - Usage in the open-source ecosystem including Ray PyTorch Core PyTorch Symmetric Memory PyTorch and"
X Link 2025-12-14T07:39Z 27K followers, 18K engagements
"Multimodal serving pain: vision encoder work can stall text prefill/decode and make tail latency jittery. We built Encoder Disaggregation (EPD) in vLLM: run the encoder as a separate scalable service pipeline it with prefill/decode and reuse image embeddings via caching. This provides an efficient and flexible pattern for multimodal serving. Results: consistently higher throughput (520% across stable regions) and significant reductions in P99 TTFT and P99 TPOT. Read more: #vLLM #LLMInference #Multimodal https://blog.vllm.ai/2025/12/15/vllm-epd.html https://blog.vllm.ai/2025/12/15/vllm-epd.html"
X Link 2025-12-15T11:56Z 27.4K followers, 10.9K engagements
"๐Amazing work from @Winterice10 and the team on fast video generationโก We're excited about the upcoming collaboration to integrate TurboDiffusion into vLLM-Omni. Check it out TurboDiffusion: [------] faster video generation on a single RTX [----] ๐ Only takes 1.8s to generate a high-quality 5-second video. The key to both high speed and high quality ๐SageAttention + Sparse-Linear Attention (SLA) + rCM Github: https://t.co/ybbNBjgHFP Technical https://t.co/6d6foxEQ9Z TurboDiffusion: [------] faster video generation on a single RTX [----] ๐ Only takes 1.8s to generate a high-quality 5-second video."
X Link 2025-12-16T00:11Z 27.5K followers, 10.1K engagements
"vLLM delivers even more inference performance with the same GPU platform. In just [--] month we've worked with NVIDIA to increase @nvidia Blackwell maximum throughput per GPU by up to 33% -- significantly reducing cost per token -- while also enabling even higher peak speed for the most latency-sensitive use cases powered by deep PyTorch integration and collaboration"
X Link 2025-12-18T00:29Z 28.4K followers, 99.2K engagements
"Scaling MoE inference is often communication + KV-cache bound: once you push expert parallelism decode can become dominated by collectives and imbalance and prefill stragglers can stall an entire EP group. New community benchmark results for vLLM wide-EP on multi-node H200 (Coreweave Infiniband + ConnectX-7): - Sustained 2.2k tokens/s per H200 GPU (up from earlier 1.5k tokens/s per GPU) In the post we share the key pieces that enable this: - Wide-EP (--enable-expert-parallel) for DeepSeek-style MoE + MLA KV efficiency - DeepEP all-to-all Dual-batch Overlap (DBO) and Expert Parallel Load"
X Link 2025-12-18T16:45Z 28.4K followers, 30.1K engagements
"Diffusion serving is expensive: dozens of timesteps per image and a lot of redundant compute between adjacent steps. โกvLLM-Omni now supports diffusion cache acceleration backends (TeaCache + Cache-DiT) to reuse intermediate Transformer computations no retraining minimal quality impact ๐Benchmarks (NVIDIA H200 Qwen-Image 1024x1024): TeaCache 1.91x Cache-DiT 1.85x. For Qwen-Image-Edit Cache-DiT hits 2.38x Blog: Docs: #vLLM #vLLMOmni #DiffusionModels #AIInference https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion_acceleration"
X Link 2025-12-20T04:25Z 27.8K followers, 20.2K engagements
"๐Thanks to the communitys contributions and our collaboration with Qwen-Image team Qwen-Image-Layered is now supported in vLLM-Omni Check it out https://github.com/vllm-project/vllm-omni/pull/381 ๐จ Qwen-Image-Layered is LIVE native image decomposition fully open-sourced โจ Why it stands out โ
Photoshop-grade layering Physically isolated RGBA layers with true native editability โ
Prompt-controlled structure Explicitly specify [---] layers from coarse layouts to https://t.co/g5mvTt0KTT https://github.com/vllm-project/vllm-omni/pull/381 ๐จ Qwen-Image-Layered is LIVE native image decomposition"
X Link 2025-12-20T18:21Z 27.9K followers, 10.6K engagements
"๐ขvLLM v0.13.0 is now available. Huge thanks to the community. Highlights: Engine core: compile_ranges for selective kernel compilation PrefixLM support for FlexAttention + TritonAttention and CUDA graphs for 3D Triton attention. Plus: xxHash option for prefix caching chunked prefill for ALL pooling tasks and Model Runner V2 updates (min-p sampling logits NaN detection)"
X Link 2025-12-22T06:09Z 27.8K followers, 24.7K engagements
"Congrats to the GLM team on GLM-4.7 a step up in the GLM-4.x series with day-0 serving support in vLLM๐ โก Support MTP decode (faster throughput). ๐งฐ Tool/function calling. ๐ง Thinking controls: interleaved/preserved/per-turn. Command in screenshot below๐ Read more: https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html GLM-4.7 is here GLM-4.7 surpasses GLM-4.6 with substantial improvements in coding complex reasoning and tool usage setting new open-source SOTA standards. It also boosts performance in chat creative writing and role-play scenarios. Default Model for Coding Plan:"
X Link 2025-12-23T01:00Z 27.6K followers, 17.8K engagements
"๐ The vLLM community has added support for LongCat-Image-Edit (from @Meituan_LongCat team) in vLLM-Omni. - Simpler path to serve instruction-following image edits - Supports common operations like object add/replace background changes and style adjustments - Useful for retouching tools and creative editing pipelines Recipe: https://docs.vllm.ai/projects/recipes/en/latest/Meituan/Longcat.html https://twitter.com/i/web/status/2003313952516714683 https://docs.vllm.ai/projects/recipes/en/latest/Meituan/Longcat.html https://twitter.com/i/web/status/2003313952516714683"
X Link 2025-12-23T03:57Z 29.8K followers, [----] engagements
"Huge congrats to @MiniMaxAI_ on shipping M2.1 ๐ฅณ A massive step forward for open-source agents. vLLM provides Day-0 support for this release. ๐ We are excited to empower the community to run this full-stack development powerhouse with maximum efficiency. Model Repo: Deploy Recipe: Commands below ๐ #vLLM #MiniMax #OpenSource #AI #LLM https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html#launching-minimax-m21m2-with-vllm https://huggingface.co/MiniMaxAI/MiniMax-M2.1 https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html#launching-minimax-m21m2-with-vllm"
X Link 2025-12-26T09:13Z 28.2K followers, 22.2K engagements
"๐ Big news: The official vLLM website is LIVE Weve built a dedicated hub for our growing community separating logistics from code so the GitHub repo can focus purely on development. Highlights: โจ Interactive vLLM Install Selector (GPU CPU etc) ๐
Community Events Calendar (Office hours Meetups etc) ๐ Centralized Resources (Doc Recipe etc) Check it out ๐ https://vllm.ai https://vllm.ai https://vllm.ai https://vllm.ai"
X Link 2025-12-29T02:09Z 28.4K followers, 54.6K engagements
"Thanks for diving deep into vLLM and sharing your findings. ๐ซถ Were working to make more beginner-friendly documentation available. In the meantime we recommend using the Search (AI) button on our website to find what you need and also following up vLLM office hours https://www.youtube.com/playlistlist=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 http://vllm.ai http://vllm.ai we dive deep into vllm-project/vllm https://t.co/dOjzontlCj https://www.youtube.com/playlistlist=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 http://vllm.ai http://vllm.ai we dive deep into vllm-project/vllm https://t.co/dOjzontlCj"
X Link 2025-12-29T14:00Z 28.4K followers, 36.8K engagements
"๐ What a way to end [----] ๐vllm-project/vllm just hit [----] contributors on GitHub. This milestone belongs to all of you. From the first-time contributor fixing a doc typo to the systems engineer rewriting kernelsyou are the reason vLLM evolves so fast. Thank you for every PR every issue and every debate. We built this engine together. ๐๐ With this speed comes complexity. To help you track your code we added a little utility. ๐"
X Link 2025-12-31T03:37Z 28.8K followers, [----] engagements
"Congrats to @Alibaba_Qwen on the release of Qwen-Image-2512 ๐ We are thrilled to announce Day-0 support in vLLM-Omni. You can now serve this SOTA open-source image model with our optimized pipelined architecture immediately. Read more: ๐ See it running below: https://github.com/vllm-project/vllm-omni/pull/547 https://github.com/vllm-project/vllm-omni/pull/547 ๐ANewYeargiftfromQwenQwen-Image-2512ishere. ๐OurDecemberupgradetoQwen-ImagejustintimefortheNewYear. โจWhatsnew: MorerealistichumansdramaticallyreducedAIlookricherfacialdetails Finernaturaltexturessharperlandscapeswater"
X Link 2025-12-31T11:56Z 29.9K followers, 34K engagements
"Spot on @OracleDevs combining vLLM with NVIDIA MIG unlocks peak GPU performance and turns hardware into real AI value With NVIDIA MIG for fine-grained GPU partitioning OKE for scheduling and autoscaling vLLM for inference-optimized serving and DGCM observability for monitoring MIG-level metrics customers can unlock every ounce of performance from their A100 H100 etc. investments and turn https://t.co/kshxcuceNt With NVIDIA MIG for fine-grained GPU partitioning OKE for scheduling and autoscaling vLLM for inference-optimized serving and DGCM observability for monitoring MIG-level metrics"
X Link 2026-01-04T03:24Z 28.3K followers, [----] engagements
"๐ Celebrating vLLM Semantic Router v0.1 Iris ๐ A huge milestone with 600+ PRs from 50+ contributors This release brings system-level intelligence to Mixture-of-Models (MoM). Highlights: โจ SignalDecision Plugin Chain ๐ก HaluGate (Hallucination detection) โก Modular LoRA & Helm charts Read the release: #vLLM #SemanticRouter #OpenSource #AI https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html"
X Link 2026-01-06T02:45Z 28.7K followers, 13.7K engagements
"Huge congratulations to the community on shipping vLLM-Omni v0.12.0rc1 ๐ This release shifts the focus from enabling multimodal to making it production-gradefaster stable and standard-compliant. Highlights from the [---] commits: ๐ Diffusion Overhaul: Integrated TeaCache Cache-DiT Sage Attention Ulysses Sequence Parallelism & Ring Attention for faster generation. ๐ Standard Serving: Native OpenAI-compatible endpoints for Image & Speech. ๐น New Models: Support for Wan2.2 (Video) Qwen-Image-2512 & SD3. ๐ค AMD Ready: Official ROCm Docker & CI support. A massive thank you to the [--] contributors"
X Link 2026-01-06T10:15Z 28.7K followers, [----] engagements
"๐16k TPS with vLLM on B200 Thanks for sharing this success; it's inspiring our community to push boundaries. 16k tokens per second ๐คฏ i have NEVER seen this many tokens in my life nvidia B200 from prime trinity mini from arcee (26b moe) served by vllm (0.13) with [--] tensors parallelism medical SYNTH dataset generation pipeline [---] req/s 16k tps DAMN https://t.co/Ov8TWhmvOZ 16k tokens per second ๐คฏ i have NEVER seen this many tokens in my life nvidia B200 from prime trinity mini from arcee (26b moe) served by vllm (0.13) with [--] tensors parallelism medical SYNTH dataset generation pipeline 350"
X Link 2026-01-08T09:33Z 28.4K followers, 12.6K engagements
"Max out your inference throughput with vLLM's new KV Offloading Connector ๐ This feature from IBM Research allows asynchronous offloading of KV cache to CPU RAM effectively handling request preemptions and boosting concurrency. โก Up to 9x increase in throughput on H100 โก 2x-22x reduction in TTFT for cache hits https://twitter.com/i/web/status/2009217642507477222 https://twitter.com/i/web/status/2009217642507477222"
X Link 2026-01-08T10:56Z 28.7K followers, 13.5K engagements
"To make this efficient the team optimized the Host-Device transfer pipeline. By restructuring GPU memory to use contiguous physical blocks (KB MB) this design unlocks high-speed DMA transfers that run asynchronously without stalling GPU computation"
X Link 2026-01-08T10:56Z 28.4K followers, [----] engagements
"Want to try it --kv_offloading_backend native --kv_offloading_size GB Read the full deep dive on these optimizations for memory variance and DMA: #vLLM #AI #Inference https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html"
X Link 2026-01-08T10:56Z 28.4K followers, [----] engagements
"๐ vLLM-Omni now Day-0 supports GLM-Image by @Zai_org ๐จโจ GLM-Image combines Autoregressive & Diffusion capability for realistic text-rich image generation. ๐ฅ Highlights: โ
Hybrid Architecture โ
Superior Text Rendering โ
Complex Logic Handling โ
Unified Generation/Editing Try it now via PR #763 ๐ Performance optimizations for the AR component are coming soon Stay tuned. โก https://github.com/vllm-project/vllm-omni/pull/763 Introducing GLM-Image: A new milestone in open-source image generation. GLM-Image uses a hybrid auto-regressive plus diffusion architecture combining strong global"
X Link 2026-01-14T12:39Z 29.9K followers, [----] engagements
"Batch inference unlocks vLLM's full throughput potential. ๐ @charles_irl at @modal demonstrates how to fully saturate an H100 (100% GPU util) with Qwen [--] 8B: ๐ FlashInfer backend ๐ Async scheduling ๐ Optimized batch sizes Result: Maximum throughput at minimal cost. Read the guide: https://modal.com/docs/examples/vllm_throughput We also show how to hit 30k input tok/s & 2k output tok/s per H100 with the same model in @vllm_project for a latency-insensitive "batch" workload (summarizing SEC filings for insider trades). At current rates that's 5x cheaper than APIs. https://t.co/EF7DEvLPIM"
X Link 2026-01-14T23:44Z 30.3K followers, 30.6K engagements
"7x Longer Context RL with @UnslothAI and @vllm_project You can now do reinforcement learning training with [--] longer context and no accuracy loss via our new batching algorithms. Long reasoning chains in RL are costly but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU. https://t.co/FFwxxq2v2N https://t.co/1h5O8rHZTL You can now do reinforcement learning training with [--] longer context and no accuracy loss via our new batching algorithms. Long reasoning chains in RL are costly but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU."
X Link 2026-01-15T17:46Z 29.3K followers, 13.6K engagements
"The first @vllm_project meetup of [----] in Munich is here Talks & demos from @RedHat @AMD @MistralAI and CROZ + hands-on GPU workshop. See you there ๐ฅจ๐ป Munich AI builders ๐ Join the @vllm_project meetup on [--] Feb for real world GenAI inference and optimization. Talks and demos from @RedHat @AMD @MistralAI and CROZ plus hands on GPU inference and time to connect with engineers building open AI. ๐ https://t.co/hZOi0Ei0bw https://t.co/EtvEzd1UYm Munich AI builders ๐ Join the @vllm_project meetup on [--] Feb for real world GenAI inference and optimization. Talks and demos from @RedHat @AMD"
X Link 2026-01-16T10:03Z 28.9K followers, [----] engagements
"๐ Day-0 support for FLUX.2 klein is now available in vLLM-Omni FLUX.2 klein from @bfl_ml brings high-performance image generation balancing speed with top-tier aesthetics: โก Sub-second Inference: 0.5s per image for real-time apps. ๐จ All-in-One: Integrated Text-to-Image & Inpainting. ๐ Consumer Friendly: Runs on consumer GPUs (13GB VRAM). ๐ Open Source: Apache [---] licensed 4B model. Try it out with the example script in PR #809: Model: #vLLM #vLLMOmni #FLUX2 #OpenSource https://huggingface.co/black-forest-labs/FLUX.2-klein-4B https://github.com/vllm-project/vllm-omni/pull/809 Introducing"
X Link 2026-01-16T10:29Z 30.2K followers, [----] engagements
"๐ Congrats @StepFun_ai vLLM now has Day-0 support for Step3-VL-10B a 10B multimodal model that punches way above its weight ๐ฅ 10B params SOTA performance ๐ง [--] reasoning modes: SeRe (Sequential) & PaCoRe (Parallel โก 10x-20x more efficient than larger models Download & Details: PR: https://github.com/vllm-project/vllm/pull/32329 https://huggingface.co/stepfun-ai/Step3-VL-10B ๐10B parameters 200B+ performance Introducing STEP3-VL-10B : our open-source SOTA vision language model. ๐ฅ At just 10B it redefines efficiency by matching or exceeding the capabilities of 100B/200B-scale models. SOTA"
X Link 2026-01-21T00:05Z 29.9K followers, [----] engagements
"๐ gRPC server entrypoint (#30190) Binary protocol + HTTP/2 multiplexing for high-throughput serving. ๐ง --max-model-len auto (#29431) Automatically fits context length to available GPU memory - no more OOM at startup ๐ Model inspection view (#29450) See modules attention backends and quantization by setting VLLM_LOG_MODEL_INSPECTION=1 or printing the LLM object"
X Link 2026-01-21T03:16Z 29.7K followers, [----] engagements
"New Model Support: ๐ฆ Grok-2 with tiktoken tokenizer ๐ LFM2-VL vision-language model โก MiMo-V2-Flash ๐ GLM-ASR audio ๐งฉ K-EXAONE-236B-A23B MoE LoRA now supports multimodal tower/connector for LLaVA BLIP2 PaliGemma Pixtral and more ๐ฅ"
X Link 2026-01-21T03:16Z 30K followers, [----] engagements
"CUTLASS MoE optimizations: 2.9% throughput + 10.8% TTFT improvement via fill(0) optimization ๐ Hardware updates: ๐ป SM103 support โซ B300 Blackwell MoE configs ๐ข Marlin for Turing (sm75) Large-scale serving: XBO (Extended Dual-Batch Overlap) NIXL asymmetric TP ๐ ๐ https://github.com/vllm-project/vllm/releases/tag/v0.14.0 https://github.com/vllm-project/vllm/releases/tag/v0.14.0"
X Link 2026-01-21T03:16Z 29.9K followers, [----] engagements
"Love seeing this. ๐ฅ This is vLLM doing what it does best: serving as the bridge between powerful open source models like GLM-4.7-Flash and real-world inference. The workflow shown in vLLM Studio by @0xSero as a playground makes that connection tangible. Onward to easy fast and cheap inference for everyone ๐ Man what a model. I have not seen any mode below 200B act like this. It's really doing a good job in a pretty novel environment. The ZAI team is doing a great job I'm very happy they still open source all this. It's very impressive. https://t.co/BB6A0ezLGY https://t.co/kDj5KzWuFY Man"
X Link 2026-01-23T03:12Z 30.3K followers, 14.9K engagements
"vLLM 0.4.0 release has a suite of new additions To start we added support for @cohere's Command+R @DbrxMosaicAI's DBRX @Alibaba_Qwen's latest MoE. Notably all of these models are supported on DAY ONE directly from the model builders"
X Link 2024-04-05T16:18Z 31.2K followers, [----] engagements
"We just released v0.5.0 of vLLM one week away from our one year anniversary The new release marks many features entering beta phase ready for testing and usage. These features boost your LLM serving system for better performance and utilization. https://github.com/vllm-project/vllm/releases/tag/v0.5.0 https://github.com/vllm-project/vllm/releases/tag/v0.5.0"
X Link 2024-06-14T00:51Z 31.2K followers, 11.7K engagements
"vLLM running on the latest @nvidia H200 GPUs is delivering high token throughput per user on Llama [---] 405B out of the gate Check it out ๐"
X Link 2024-07-23T17:45Z 31.2K followers, 12.8K engagements
"๐ Thank you @nvidia for sponsoring vLLM development. The DGX H200 machine is marvelous We plan to use the machine for benchmarking and performance enhancement ๐"
X Link 2024-08-08T20:28Z 31.2K followers, 42.1K engagements
"A month ago we announced our performance roadmap. Today we are happy to share that the latest release achieves ๐2.7x higher throughput and is 5x faster for output latency on Llama 8B and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://blog.vllm.ai/2024/09/05/perf-update.html https://blog.vllm.ai/2024/09/05/perf-update.html"
X Link 2024-09-05T17:12Z 31.2K followers, 95.9K engagements
"We are excited to see @vllm_project as an option for local apps in the @huggingface hub It comes with easy snippets to quickly test out the model"
X Link 2024-09-09T21:35Z 31.3K followers, 25.5K engagements
"๐ผ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens [-----] https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py magnet:xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%https://t.co/OdtBUsbMKD%3A1337%2Fannounce&tr=udp%3A%2F%https://t.co/2UepcMHjvL%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/NsTRgy7h8S%3A80%2Fannounce https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py"
X Link 2024-09-12T02:00Z 31.2K followers, 36K engagements
"pip install -U vLLM ๐ฆ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs [--] We are excited to announce day [--] support for @AIatMeta's Llama [---] vision language models. Try it out using our latest released. Please see for more commands and known issues. https://github.com/vllm-project/vllm/issues/8826 ๐ฃ Introducing Llama 3.2: Lightweight models for edge devices vision models and more Whats new Llama [---] 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases with support for @Arm @MediaTek & @Qualcomm on day one."
X Link 2024-09-25T22:51Z 31.2K followers, 20.4K engagements
"Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today ๐งต https://blog.vllm.ai/2024/10/17/spec-decode.html https://blog.vllm.ai/2024/10/17/spec-decode.html"
X Link 2024-10-22T22:28Z 31.2K followers, 31K engagements
"Open-source innovation is part of the vLLMs DNA and we love the PyTorch ecosystem Together let's push the boundaries of AI innovation and make it accessible to all๐ช vLLM Joins PyTorch Ecosystem ๐ @vllm_project has always had a strong connection with the PyTorch project. Tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms. Read more: https://t.co/GgTrTt2srd https://t.co/YYEQFVLlD6 vLLM Joins PyTorch Ecosystem ๐ @vllm_project has always had a strong connection with the PyTorch project. Tight coupling with PyTorch ensures"
X Link 2024-12-09T21:06Z 31.2K followers, 24.1K engagements
"pip install -U vLLM You can now run DeepSeek-V3 on latest vLLM many different ways: ๐ฐ Tensor parallelism on 8xH200 or MI300x or TP16 on IB connected nodes: --tensor-parallel-size ๐ Pipeline parallelism () across two 8xH100 or any collection of machines without high speed interconnect: --pipeline-parallel-size ๐ CPU offloading the model layers via --cpu-offload-gb See our distributed guide and enhancement plan We are at the start of optimizing this amazing model https://github.com/vllm-project/vllm/issues/11539 https://docs.vllm.ai/en/latest/serving/distributed_serving.html ๐ Introducing"
X Link 2024-12-27T01:24Z 31.2K followers, 35.6K engagements
"Thanks for the community support vLLM now supports MacOS natively๐คฉ Experimental support for Apple SoC (CPU only for now) has landed in @vllm_project. You'll need to compile from git HEAD. Link in ๐งต Experimental support for Apple SoC (CPU only for now) has landed in @vllm_project. You'll need to compile from git HEAD. Link in ๐งต"
X Link 2025-01-10T02:27Z 31.3K followers, 17.5K engagements
"๐ With the v0.7.0 release today we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup Clean code optimized execution loop zero-overhead prefix caching enhanced multimodal support and more"
X Link 2025-01-27T19:52Z 31.2K followers, 96.1K engagements
"We landed the 1st batch of enhancements to the @deepseek_ai models starting MLA and cutlass fp8 kernels. Compared to v0.7.0 we offer 3x the generation throughput 10x the memory capacity for tokens and horizontal context scalability with pipeline parallelism"
X Link 2025-02-01T23:46Z 31.2K followers, 91.5K engagements
"v0.7.2 is released Featuring ๐ผ @Alibaba_Qwen Qwen2.5-VL ๐ค @huggingface Transformers backend and several @deepseek_ai performance enhancements"
X Link 2025-02-06T19:09Z 31.2K followers, 102.9K engagements
"vLLM v0.7.3 now supports @deepseek_ai's Multi-Token Prediction module It delivers up to 69% speedup boost. You can turn it on with --num-speculative-tokens=1 and an optional --draft-tensor-parallel-size=1. We saw 81-82.3% acceptance rate on the ShareGPT"
X Link 2025-02-20T18:45Z 31.3K followers, 40K engagements
"We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development Thank you @nvidia"
X Link 2025-02-21T18:15Z 31.2K followers, 119.5K engagements
"๐ We just merged initial EP support in vLLM and will be integrating these collectives ASAP ๐ https://github.com/vllm-project/vllm/pull/12583 ๐ Day [--] of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference. โ
Efficient and optimized all-to-all communication โ
Both intranode and internode support with NVLink and RDMA โ
https://github.com/vllm-project/vllm/pull/12583 ๐ Day [--] of #OpenSourceWeek: DeepEP Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and"
X Link 2025-02-25T02:37Z 31.2K followers, 40.2K engagements
"๐ @vllm_project will be testing and integrating these GEMM kernels ASAP as well. ๐ Day [--] of #OpenSourceWeek: DeepGEMM Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs powering V3/R1 training and inference. โก Up to 1350+ FP8 TFLOPS on Hopper GPUs โ
No heavy dependency as clean as a tutorial โ
Fully Just-In-Time compiled ๐ Day [--] of #OpenSourceWeek: DeepGEMM Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs powering V3/R1 training and inference. โก Up to 1350+ FP8 TFLOPS on Hopper GPUs โ
No heavy dependency as clean as a tutorial"
X Link 2025-02-26T01:55Z 31.3K followers, 23K engagements
"Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16% We expect more improvements in the coming days as we continue to optimize the code path. https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs optimized for variable-length sequences and now in production. โ
BF16 support โ
Paged KV cache (block size 64) โก [----] GB/s memory-bound & [---] TFLOPS https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our"
X Link 2025-02-27T06:15Z 31.4K followers, 49.5K engagements
"Amazing system It is now the north star for LLM inference ๐. We will get there quickly. ๐ Day [--] of #OpenSourceWeek: One More Thing DeepSeek-V3/R1 Inference System Overview Optimized throughput and latency via: ๐ง Cross-node EP-powered batch scaling ๐ Computation-communication overlap โ Load balancing Statistics of DeepSeek's Online Service: โก 73.7k/14.8k ๐ Day [--] of #OpenSourceWeek: One More Thing DeepSeek-V3/R1 Inference System Overview Optimized throughput and latency via: ๐ง Cross-node EP-powered batch scaling ๐ Computation-communication overlap โ Load balancing Statistics of"
X Link 2025-03-01T04:52Z 31.3K followers, 46.8K engagements
"Spotted @vllm_project during Jensen's Keynote @nvidia #GTC"
X Link 2025-03-18T18:31Z 31.3K followers, 38.6K engagements
"vLLM v0.8.3 now supports @AIatMeta's latest Llama [--] Scout and Maverick. We see these open source models as a major step forward in efficiency with long context feature native multi-modality and MoE architecture. Best tips of running it ๐งต https://blog.vllm.ai/2025/04/05/llama4.html https://blog.vllm.ai/2025/04/05/llama4.html"
X Link 2025-04-06T07:03Z 31.3K followers, 24.9K engagements
"spotted @vllm_project at @googlecloud next keynote today"
X Link 2025-04-10T04:42Z 31.2K followers, 51.6K engagements
"๐ @deepseek_ai's highly performant inference engine is built on top of vLLM. Now they are open-sourcing the engine the right way: instead of a separate repo they are bringing changes to the open source community so everyone can immediately benefit https://github.com/deepseek-ai/open-infra-index/blob/main/OpenSourcing_DeepSeek_Inference_Engine/README.md https://github.com/deepseek-ai/open-infra-index/blob/main/OpenSourcing_DeepSeek_Inference_Engine/README.md"
X Link 2025-04-14T06:34Z 31.2K followers, 206.6K engagements
"vLLM๐ค๐ค You can now deploy any @huggingface language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. ๐งต https://blog.vllm.ai/2025/04/11/transformers-backend.html https://blog.vllm.ai/2025/04/11/transformers-backend.html"
X Link 2025-04-17T19:57Z 31.3K followers, 73.8K engagements
"perf update: we are continuing to see benefits with vLLM V1 engines highly performant design. on 8xH200 vLLM leads in throughput for @deepseek_ai V3/R1 models. we expect further enhancements in collaboration with DeepSeeks inference engine open source plan"
X Link 2025-04-17T22:59Z 31.2K followers, 42.1K engagements
"After feedback about our v0.8.4 benchmark for @deepseek_ai R1 we rerun it with suggested changes: vLLM no EP SGLang updated v0.4.5 - post1 and EP - DP TensorRT-LLM uses overlap scheduler and tuned parameters. We are seeing good results So why was there a difference ๐งต"
X Link 2025-04-19T08:41Z 31.4K followers, 80.2K engagements
"OpenRLHF is a pioneering framework to use vLLM for RLHF driving many design and implementation of vLLM's features for RLHF making vLLM a popular choice for many RLHF frameworks. Learn more about the story at https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html"
X Link 2025-04-24T07:29Z 31.2K followers, 48.4K engagements
"looking amazing We found the quality of the documentation and diagram to be great and accurate. Will recommend :D https://deepwiki.com/vllm-project/vllm Project DeepWiki Up-to-date documentation you can talk to for every repo in the world. Think Deep Research for GitHub powered by Devin. Its free for open-source no sign-up Visit deepwiki com or just swap github deepwiki on any repo URL: https://t.co/5bHbvq98Ud https://deepwiki.com/vllm-project/vllm Project DeepWiki Up-to-date documentation you can talk to for every repo in the world. Think Deep Research for GitHub powered by Devin. Its free"
X Link 2025-04-25T22:47Z 31.3K followers, 14K engagements
"pip install -U vLLM vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size [--] vLLM introduce Day [--] support for @Alibaba_Qwen Qwen3 and Qwen3 MoE model architecture. Try it out: https://github.com/vllm-project/vllm/issues/17327 Introducing Qwen3 We release and open-weight Qwen3 our latest large language models including [--] MoE models and [--] dense models ranging from 0.6B to 235B. Our flagship model Qwen3-235B-A22B achieves competitive results in benchmark evaluations of coding math general https://t.co/JWZkJeHWhC"
X Link 2025-04-29T00:11Z 31.3K followers, 60.3K engagements
"Very cool spec decode technique built on top of @vllm_project We love the idea of suffix decoding ๐ Excited to share our work on Speculative Decoding @Snowflake AI Research ๐ 4x faster LLM inference for coding agents like OpenHands @allhands_ai ๐ฌ 2.4x faster LLM inference for interactive chat ๐ป Open-source via Arctic Inference as a plugin for @vllm_project ๐งต https://t.co/cdTuI7b9Yr Excited to share our work on Speculative Decoding @Snowflake AI Research ๐ 4x faster LLM inference for coding agents like OpenHands @allhands_ai ๐ฌ 2.4x faster LLM inference for interactive chat ๐ป"
X Link 2025-05-02T18:12Z 31.3K followers, [----] engagements
"Great work We love how @vllm_project is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked This is a small feature we implemented to make RLHF smoother with vLLM. Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini) ๐งต https://t.co/I3UXLGKFNb Our research interns present: Kevin-32B = K(ernel D)evin It's the first"
X Link 2025-05-07T01:45Z 31.3K followers, 12.9K engagements
""inference uses @vllm_project" Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2 Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2"
X Link 2025-05-12T05:39Z 31.2K followers, 10.7K engagements
"uv pip install -U vLLM The latest release features [---] commits from [---] contributors. vLLM is now ready for @nvidia Blackwell with the latest @PyTorch [---] upgrade. Huge thanks to @NVIDIAAIDev and @ye_combinator for the CUTLASS and FlashInfer kernels"
X Link 2025-05-28T01:54Z 31.2K followers, 17.7K engagements
"Congrats on the launch vLLM is proud to support the new Qwen3 embedding models check it out ๐๐ป https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series setting new standards in multilingual text embedding and relevance ranking โจ Highlights: โ
Available in 0.6B / 4B / 8B versions โ
Supports [---] languages โ
State-of-the-Art performance on MMTEB MTEB https://t.co/qNu0rswSol https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series"
X Link 2025-06-05T15:45Z 31.3K followers, 12.2K engagements
"uv pip install -U vllm --extra-index-url --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today ๐ฎ https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh"
X Link 2025-06-10T14:37Z 31.2K followers, 11.7K engagements
"๐ Look what just arrived at @BerkeleySky ๐ A shiny MI355X system. Huge thanks to @AMD for supporting open source and we are looking forward to getting it set up in the next few days"
X Link 2025-06-10T17:20Z 31.3K followers, 20.7K engagements
"Thank you @AMD @LisaSu @AnushElangovan for Advancing AI together with @vllm_project We look forward to the continued partnership and pushing the boundary of inference"
X Link 2025-06-13T23:54Z 31.2K followers, 17.3K engagements
"Congrats on the launch vLLM is proud to support this great model on day [--] looking forward to the following releases Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency: https://t.co/bGfDlZA54n Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output -"
X Link 2025-06-16T16:14Z 31.3K followers, 19K engagements
"vLLM has just reached 50K github stars Huge thanks to the community๐ Together let's bring easy fast and cheap LLM serving for everyoneโ๐ป"
X Link 2025-06-19T05:25Z 31.2K followers, 18.3K engagements
""The second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window" Checkout the blog post from @minimax_ai on how the model is implemented on vLLM and how you can run this model efficiently https://blog.vllm.ai/2025/06/30/minimax-m1.html MiniMax launches their first reasoning model: MiniMax M1 the second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window @minimax_ai M1 is based on their Text-01 model (released [--] Jan 2025) - an MoE with 456B total and 45.9B active https://t.co/JltMYrm0te"
X Link 2025-07-01T04:16Z 31.2K followers, 13.9K engagements
"We genuinely want to solve this problem. As many (@Rxday000 @samsja19 @danielhanchen @_EldarKurtic and more) chimed in the reason includes attention kernels matmul reduction order precisions in various operators and more horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh"
X Link 2025-07-06T21:30Z 31.2K followers, 37.3K engagements
"vLLM runs on free-threaded Python A group of engineers from @Metas Python runtime language team has shown that its possible to run vLLM on the nogil distribution of Python. Were incredibly excited to embrace this future technique and be early adopters ๐"
X Link 2025-07-08T05:06Z 31.2K followers, 50.6K engagements
"@Kimi_Moonshot just released a trillion-parameter model with great agentic capability and it is already supported in vLLM Have a try with a simple command and check the doc for more advanced deployment๐ ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench Verified Tau2 & AceBench among open models ๐นStrong in coding and agentic tasks ๐ค Multimodal & thought-mode not supported for now With Kimi K2 advanced agentic intelligence https://t.co/PlRQNrg9JL ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench"
X Link 2025-07-11T15:06Z 31.4K followers, 34.4K engagements
"Thanks for the great write-up ๐ Prefix caching is critical for agentic workflows like @manusai and vLLM makes it seamless. โ
prefix caching is enabled by default with an efficient implementation โ
Append-only context Cache hit heaven Context engineering FTW ๐ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ"
X Link 2025-07-19T14:20Z 31.2K followers, 15K engagements
"The @huggingface Transformers @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers you can now run it directly with vLLM. Great work @RTurganbay ๐ https://github.com/vllm-project/vllm/pull/20543 https://github.com/vllm-project/vllm/pull/20543"
X Link 2025-07-22T20:32Z 31.2K followers, 22.3K engagements
"โ
Try out @Alibaba_Qwen [--] Coder on vLLM nightly with "qwen3_coder" tool call parser Additionally vLLM offers expert parallelism so you can run this model in flexible configurations where it fits. Qwen3-Coder is here โ
Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves https://t.co/Z8HfyrVScE Qwen3-Coder is here โ
Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to"
X Link 2025-07-22T22:06Z 31.2K followers, 34.2K engagements
"This amazing Attention-FFN disaggregation implementation from @StepFun_ai achieves decoding throughput of up to [----] tokens per second per GPU under 50ms TPOT SLA for their 321B-A38B MoE model Step3 served with H800 The implementation is based on vLLM and we are working together to bring it to the public Kudos to @StepFun_ai ๐ Check out their tech report at . https://github.com/stepfun-ai/Step3 https://github.com/stepfun-ai/Step3"
X Link 2025-07-26T01:26Z 31.2K followers, 33.6K engagements
"๐ฅ vLLM @ PyTorch Conference [----] ๐ฅ Were excited to share that [--] talks at this years PyTorch Conference will feature vLLM Topics include: Easy & Fast LLM Serving Open-Source Post-Training Stack Scaling Online LLM Training AMD GPU support via Triton vllm-triton backend performance Stay tuned & come say hi ๐ #vLLM #PyTorch #LLM #AI #opensource ๐ ICYMI: The #PyTorchConf schedule is now live https://t.co/YSAdVaiWRk Were talking [--] days of cutting-edge talks on #LLM scaling real-time inference model optimization & more straight from the #PyTorch community. ๐ Oct [----] San Francisco ๐ Register:"
X Link 2025-07-31T07:31Z 31.2K followers, 29.1K engagements
"Thank you @OpenAI for open-sourcing these great models ๐ Were proud to be the official launch partner for gpt-oss (20B & 120B) now supported in vLLM ๐ โก MXFP4 quant = fast & efficient ๐ Hybrid attention (sliding + full) ๐ค Strong agentic abilities ๐ Easy deployment ๐๐ป Check out the blog and recipes for more details ๐ฅ #vLLM #gptOSS #openai https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html https://blog.vllm.ai/2025/08/05/gpt-oss.html Our open models are here. Both of them. https://t.co/9tFxefOXcg https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html"
X Link 2025-08-05T17:31Z 31.2K followers, 44.3K engagements
"๐ we care a lot about correctness ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified if you run into any correctness issue on vLLM we would love to know and debug them Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. Were working with inference providers to make sure gpt-oss performs at its best everywhere and wed love your feedback Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across"
X Link 2025-08-06T03:51Z 31.2K followers, 66.7K engagements
"Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug We have collaborated with the @nvidia team to investigate how cuda core dump can help check out the blogpost to learn more https://blog.vllm.ai/2025/08/11/cuda-debugging.html https://blog.vllm.ai/2025/08/11/cuda-debugging.html"
X Link 2025-08-13T03:55Z 31.4K followers, 34K engagements
"๐ Amazing community project vLLM CLI a command-line tool for serving LLMs with vLLM: โ
Interactive menu-driven UI & scripting-friendly CLI โ
Local + HuggingFace Hub model management โ
Config profiles for perf/memory tuning โ
Real-time server & GPU monitoring โ
Error logs & recovery ๐ฆ Install in one line: pip install vllm-cli GitHub: ๐ Would you like to see these features in vLLM itself Try it out & share feedback https://github.com/Chen-zexi/vllm-cli https://github.com/Chen-zexi/vllm-cli"
X Link 2025-08-17T08:52Z 31.2K followers, 72.8K engagements
"๐ GLM-4.5 meets vLLM @Zai_org 's latest GLM-4.5 & GLM-4.5V models bring hybrid reasoning coding & intelligent agent capabilitiesnow fully supported in vLLM for fast efficient inference on NVIDIA Blackwell & Hopper GPUs Read more ๐ https://blog.vllm.ai/2025/08/19/glm45-vllm.html https://blog.vllm.ai/2025/08/19/glm45-vllm.html"
X Link 2025-08-19T09:10Z 31.2K followers, 16.9K engagements
"๐ Exciting news: DeepSeek-V3.1 from @deepseek_ai now runs on vLLM ๐ง Seamlessly toggle Think / Non-Think mode per request โก Powered by vLLMs efficient serving scale to multi-GPU with ease ๐ Perfect for agents tools and fast reasoning workloads ๐ Guide & examples: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_1.html Introducing DeepSeek-V3.1: our first step toward the agent era ๐ ๐ง Hybrid inference: Think & Non-Think one model two modes โก Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 ๐ Stronger agent skills: Post-training"
X Link 2025-08-21T17:20Z 31.2K followers, 23.6K engagements
"Wow glad to see vLLM powers @jiawzhao 's DeepConf work impressive results on AIME [----] Do you think this sampling control makes sense Have a try and leave a comment in that PR to let us know https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf: Deep Think with Confidence ๐ First method to achieve 99.9% on AIME [----] with open-source models Using GPT-OSS-120B even without tools we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong https://t.co/MlhDUKmawH https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf:"
X Link 2025-08-23T15:31Z 31.3K followers, 48.3K engagements
"๐ vLLM Shanghai Meetup Recap ๐ Last weekend we gathered with the community in Shanghai to dive into: Contributing to vLLM Distributed inference ERNIE [---] integration Mooncake + LMCache MetaX hardware support The community is pushing vLLM to new levels of performance scalability & adaptability. ๐ Event notes: ๐ Slides: https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg"
X Link 2025-08-25T08:59Z 31.2K followers, 12.1K engagements
"๐ LLM Compressor v0.7.0 is here This release brings powerful new features for quantizing large language models including transform support (QuIP SpinQuant) mixed precision compression improved MoE handling with Llama4 support and more. Full blog: More info below ๐ https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap"
X Link 2025-08-26T20:02Z 31.2K followers, 16.8K engagements
"๐ vLLM now supports Kwai Keye-VL-1.5 With sharper video ๐น & image ๐ผ comprehension stronger reasoning and an extended 128K context length this model unlocks richer conversations and more complex tasks than ever before. Upgrade to the nightly build and experience it today Check out for more details. #AI #vLLM #KeyeVL https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B"
X Link 2025-09-01T13:36Z 31.2K followers, 10.7K engagements
"Amazing blogpost from @gordic_aleksa explaining internals of vLLM๐ New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of"
X Link 2025-09-01T16:06Z 31.4K followers, 32.8K engagements
"vLLM is proud to support the great Kimi update from @Kimi_Moonshot better tool-calling longer context and more Check the deployment guide at ๐ฅ https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/deploy_guidance.md Kimi K2-0905 update ๐ - Enhanced coding capabilities esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g. Claude Code Roo Code etc) ๐ Weights & code: https://t.co/83sQekosr9 ๐ฌ Chat with new Kimi https://t.co/mkOuBMwzpw"
X Link 2025-09-05T03:26Z 31.4K followers, 18.6K engagements
"The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost (after more proofreading and clarifications) Looking forward to future series of tech deep dive blogposts๐ https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog"
X Link 2025-09-09T01:31Z 31.3K followers, 48.2K engagements
"Wow thanks to @charles_irl you can understand internals of vLLM with a live notebook from @modal ๐ฅฐ I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM"
X Link 2025-09-10T09:27Z 31.2K followers, 30.7K engagements
"โก Efficient weight updates for RL at trillion-parameter scale ๐ก Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: Broadcast weight sync for 1T params in 20s across 1000s of GPUs Dynamic P2P updates for elastic clusters Optimized pipeline w/ overlapped H2D broadcast & reload Open source & ready for large-scale RL with vLLM ๐ Introducing checkpoint-engine: our open-source lightweight middleware for efficient in-place weight updates in LLM inference engines especially effective for RL. โ
Update a 1T model on thousands of GPUs in 20s โ
Supports both"
X Link 2025-09-10T17:06Z 31.2K followers, 24K engagements
"Thank you @cHHillee for the great explanation and demo of how to implement deterministic inference on vLLM https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is Defeating Nondeterminism in LLM Inference We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to https://t.co/jMFL3xt67C https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is"
X Link 2025-09-10T17:41Z 31.2K followers, 16.3K engagements
"Deep dive into optimizing weight transfer step by step and improving it 60x [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH"
X Link 2025-09-10T20:46Z 31.3K followers, 18.5K engagements
"Welcome Qwen3-Next You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only 3B activated per token 10x cheaper training 10x faster inference than Qwen3-32B.(esp. @ 32K+ context) ๐นHybridArchitecture:GatedDeltaNet+GatedAttentionbestofspeed& https://t.co/yO7ug721U6 https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only"
X Link 2025-09-11T19:38Z 31.4K followers, 61.7K engagements
"v0.10.2 marks the first release with official aarch64 support for vLLM You can now install vLLM directly onto @nvidia 's GB200. Along with the PyPI release our docker image is also multi-platform so pulling the right image just works. More perf enhancements on the way"
X Link 2025-09-16T00:49Z 31.3K followers, 17.9K engagements
"Congrats to @deepseek_ai DeepSeek-R1 was published in Nature yesterday as the cover article and vLLM is proud to have supported its RL training and inference๐ฅฐ"
X Link 2025-09-18T02:44Z 31.2K followers, 214.3K engagements
"Pro Tip๐กFast and simple way to deploy DeepSeek-V3.1-Terminus with vLLM โก Run it with: vllm serve deepseek-ai/DeepSeek-V3.1-Terminus -tp [--] -dcp [--] (as simple as appending -dcp [--] after -tp 8) Thanks to the @Kimi_Moonshot team vLLM 0.10.2 adds Decode Context Parallel (DCP) support: โ
Cuts KV cache duplication by sharding across GPUs โ
[--] larger KV cache โ
[--] throughput gain on single-node H200 Perfect for KV-cache hungry tasks (RL offline data generation). More blogposts diving into DCP are coming soonand optimizations for general GQA models are on the way ๐ #vLLM #DeepSeek #AIInfra ๐"
X Link 2025-09-24T11:35Z 31.3K followers, 44K engagements
"๐ New in vLLM: dots.ocr ๐ฅ A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM ๐ Single end-to-end parser for text tables (HTML) formulas (LaTeX) and layouts (Markdown) ๐ Supports [---] languages with robust performance on low-resource docs โก Compact 1.7B VLM but achieves SOTA results on OmniDocBench & dots.ocr-bench โ
Free for commercial use Deploy it in just two steps: uv pip install vllm --extra-index-url vllm serve rednote-hilab/dots.ocr --trust-remote-code Try it today and bring fast accurate OCR to your pipelines. Which models would you like"
X Link 2025-09-28T12:20Z 31.4K followers, 70.6K engagements
"How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of [---] per token (vs. [---] for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA. ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ Built on V3.1-Terminus it debuts DeepSeek Sparse Attention(DSA) for faster more efficient training & inference on long context. ๐ Now live on App Web and API. ๐ฐ API prices cut by 50%+ 1/n ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ"
X Link 2025-09-29T10:59Z 31.2K followers, 103.3K engagements
"Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai vLLM is here to help We have verified that it works on H200 machines and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes for more details ๐ Note: currently the PR is still pending but you can use our pre-compiled wheels to build directly and use the model We will push the model into main branch very soon and add many more optimizations. Stay tuned https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the"
X Link 2025-09-29T14:05Z 31.2K followers, 12.5K engagements
"Keeping BERT alive in vLLM via transformers Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE"
X Link 2025-10-02T17:40Z 31.3K followers, 17.8K engagements
"๐ The RL community keeps pushing boundaries from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild but thats exactly what PipelineRL makes work. vLLM is proud to power this kind of modular cutting-edge RL innovation. Give it a try and share your thoughts I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences Just update the weights and"
X Link 2025-10-05T07:04Z 31.4K followers, 89K engagements
"๐ vLLM x MinerU: Document Parsing at Lightning Speed Were excited to see MinerU fully powered by vLLM bringing ultra-fast accurate and efficient document understanding to everyone. โก Powered by vLLMs high-throughput inference engine MinerU [---] delivers: Instant parsing no waiting Deeper understanding for complex docs Optimized cost even consumer GPUs can fly Experience the new speed of intelligence: ๐ #vLLM #MinerU #AI #LLM #DocumentParsing #AIresearch https://github.com/opendatalab/MinerU MinerU [---] has arrived with demo on @huggingface https://t.co/KZyL6TJDSe"
X Link 2025-10-11T11:11Z 31.2K followers, 71.7K engagements
"๐ vLLM just hit 60K GitHub stars ๐ From a small research idea to powering LLM inference everywhere across NVIDIA AMD Intel Apple TPUs and more vLLM now supports almost all major text-generation models and native RL pipelines like TRL Unsloth Verl and OpenRLHF. Huge thanks to our amazing community and friends at @PyTorch @huggingface Transformers and model vendors from @AIatMeta Llama to @OpenAI GPT-OSS @Alibaba_Qwen Qwen @deepseek_ai DeepSeek and @Kimi_Moonshot Kimi and many others (sorry we ran out of space) for making this ecosystem thrive. โค Let's head to the next chapter of efficient"
X Link 2025-10-13T13:13Z 31.2K followers, 39.1K engagements
"Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. ๐ What's New - JAX + Pytorch: Run PyTorch models on TPUs with no code changes now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program Multi-Data (SPMD) as the default a"
X Link 2025-10-16T16:08Z 31.2K followers, 157.8K engagements
"kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first library for elastic GPU sharing across LLMs. ๐ https://t.co/3BC7B6s2EX ๐งต๐ Why it matters: https://t.co/jdIg1gyyOS ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first"
X Link 2025-10-21T23:22Z 31.4K followers, 55.7K engagements
"its tokenization again ๐คฏ did you know tokenize(detokenize(token_ids)) token_ids RL researchers from Agent Lightning coined the term Retokenization Drift a subtle mismatch between what your model generated and what your trainer thinks it generated. why because most agents call LLMs via OpenAI-compatible APIs that only return strings so when those strings get retokenized later token splits may differ (HAV+ING vs H+AVING) tool-call JSON may be reformatted or chat templates may vary. unstable learning off-policy updates training chaos. ๐ฌ (@karpathy has a great video explaining all details about"
X Link 2025-10-22T15:17Z 31.2K followers, 171K engagements
"๐ Excited to share our work on batch-invariant inference in vLLM Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill). Let's dive into how we built this ๐งต๐"
X Link 2025-10-22T20:02Z 31.2K followers, 38.8K engagements
"๐@Kimi_Moonshot co-founder @ppwwyyxx talking about Moonshots Decode Context Parallel open source contribution to @vllm_project at @PyTorch conf"
X Link 2025-10-24T00:20Z 31.3K followers, 17.1K engagements
"vLLM ๐ค @nvidia = open scalable agentic AI you can run anywhere. ๐งต Strengthening our partnership with @nvidia: vLLM serves the NVIDIA Nemotron family. This new blog from NVIDIA walks through how to deploy open high-accuracy agentic inference across data center and edgefast reproducible and production-ready. ๐ Model highlight Nemotron Nano [--] (9B): a small language reasoning model with a hybrid TransformerMamba design and a tunable thinking budget. Open weights + 9T tokens of open data on Hugging Face (permissive license). Excels at reasoning/coding instruction following tool calling and"
X Link 2025-10-24T02:50Z 31.2K followers, 19.2K engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
/creator/twitter::vllm_project