@vllm_project Avatar @vllm_project vLLM

vLLM posts on X about inference, llm, ai, agentic the most. They currently have [------] followers and [---] posts still getting attention that total [-----] engagements in the last [--] hours.

Engagements: [-----] #

Engagements Line Chart

Mentions: [--] #

Mentions Line Chart

Followers: [------] #

Followers Line Chart

CreatorRank: [-------] #

CreatorRank Line Chart

Social Influence

Social category influence technology brands 28.44% stocks 22.02% travel destinations 2.75% finance 2.75% countries 1.83% social networks 0.92%

Social topic influence inference #56, llm #829, ai 11.01%, agentic 11.01%, the first 7.34%, native #781, community 6.42%, strong 6.42%, up to 5.5%, performance 5.5%

Top accounts mentioned or mentioned by @kimimoonshot @pytorch @vllmproject @deepseekai @nvidia @alibabaqwen @ai_bridge_japan @nvidiaaidev @huggingface @minimaxai @redhat_ai @aiatmeta @redhatai @mistralai @aiatamd @sparkycollier @anushelangovan @dtransposed @alibaba_qwen @bygregorr

Top Social Posts

Top posts by engagements in the last [--] hours

"๐Ÿš€ DeepSeek-OCR the new frontier of OCR from @deepseek_ai exploring optical context compression for LLMs is running blazingly fast on vLLM โšก (2500 tokens/s on A100-40G) powered by vllm==0.8.5 for day-0 model support. ๐Ÿง  Compresses visual contexts up to [--] while keeping 97% OCR accuracy at [--]. ๐Ÿ“„ Outperforms GOT-OCR2.0 & MinerU2.0 on OmniDocBench using fewer vision tokens. ๐Ÿค The vLLM team is working with DeepSeek to bring official DeepSeek-OCR support into the next vLLM release making multimodal inference even faster and easier to scale. ๐Ÿ”— #vLLM #DeepSeek #OCR #LLM #VisionAI #DeepLearning"
X Link 2025-10-20T11:31Z 31.6K followers, 1.5M engagements

"Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16% We expect more improvements in the coming days as we continue to optimize the code path. https://github.com/vllm-project/vllm/pull/13747 ๐Ÿš€ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs optimized for variable-length sequences and now in production. โœ… BF16 support โœ… Paged KV cache (block size 64) โšก [----] GB/s memory-bound & [---] TFLOPS https://github.com/vllm-project/vllm/pull/13747 ๐Ÿš€ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our"
X Link 2025-02-27T06:15Z 31.4K followers, 49.5K engagements

"๐Ÿš€ Last week we hosted the vLLM Beijing Meetup with top players like Tencent @TencentHunyuan Huawei @Huawei ByteDance @ByteDanceOSS Ant Group @AntGroup Moonshot AI @Kimi_Moonshot & Xiaomi @XiaoMi_AI ๐Ÿ’ก Discover how industry leaders are using vLLM to power real-world high-performance inference systems why they choose vLLM and how they improve upon vLLM: ๐Ÿ“– WeChat Post ๐Ÿ“บ Livestream Replay ๐Ÿ“‘ Slides #vLLM #AIInference #LLMInfra https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF https://www.chaspark.com/#/live/1166916873711665152"
X Link 2025-08-07T14:31Z 31.6K followers, [----] engagements

"Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug We have collaborated with the @nvidia team to investigate how cuda core dump can help check out the blogpost to learn more https://blog.vllm.ai/2025/08/11/cuda-debugging.html https://blog.vllm.ai/2025/08/11/cuda-debugging.html"
X Link 2025-08-13T03:55Z 31.4K followers, 34K engagements

"Amazing blogpost from @gordic_aleksa explaining internals of vLLM๐Ÿ˜ New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of"
X Link 2025-09-01T16:06Z 31.4K followers, 32.8K engagements

"๐Ÿš€ New in vLLM: dots.ocr ๐Ÿ”ฅ A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM ๐Ÿ“ Single end-to-end parser for text tables (HTML) formulas (LaTeX) and layouts (Markdown) ๐ŸŒ Supports [---] languages with robust performance on low-resource docs โšก Compact 1.7B VLM but achieves SOTA results on OmniDocBench & dots.ocr-bench โœ… Free for commercial use Deploy it in just two steps: uv pip install vllm --extra-index-url vllm serve rednote-hilab/dots.ocr --trust-remote-code Try it today and bring fast accurate OCR to your pipelines. Which models would you like"
X Link 2025-09-28T12:20Z 31.4K followers, 70.6K engagements

"๐Ÿš€ The RL community keeps pushing boundaries from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild but thats exactly what PipelineRL makes work. vLLM is proud to power this kind of modular cutting-edge RL innovation. Give it a try and share your thoughts I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences Just update the weights and"
X Link 2025-10-05T07:04Z 31.4K followers, 89K engagements

"๐ŸŽ‰ Congrats to @Kimi_Moonshot vLLM Day-0 model expands Now supporting Kimi Linear hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: [----] perf + [----] speedup - Up to [--] faster decoding & [---] faster TPOT (1M tokens) - 75% KV cache reduction ๐Ÿ’ก Optimized for high-performance long-context LLM serving Fast Deployment in vLLM๐Ÿ‘‡ Kimi Linear Tech Report is dropped ๐Ÿš€ https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performanceready to serve as a drop-in replacement for full attention featuring our"
X Link 2025-10-30T16:58Z 31.4K followers, 39.8K engagements

"๐ŸŽ‰ Congrats @Alibaba_Qwen on the Qwen3-ASR release vLLM has day-0 support. [--] languages 2000x throughput on the 0.6B model singing voice recognition and SOTA accuracy on the 1.7B. Serve it now in vLLM ๐Ÿš€ Learn more: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-ASR.html Qwen3-ASR and Qwen3-ForcedAligner are now open source production-ready speech models designed for messy real-world audio with competitive performance and strong robustness. 52languages&dialectswithautolanguageID(30languages+22dialects/accents) Robustin https://t.co/q7RWjJFXgH"
X Link 2026-01-29T13:25Z 31.3K followers, 49K engagements

"Quick vLLM Tip ๐Ÿ’Ž Decode Context Parallel Using -tp but KV cache duplicated across GPUs Use -dcp size to shard KV cache along the token dimension. Core principles: [--]. TP shards KV cache along kv-heads (H dimension) [--]. When tp_size H KV cache gets duplicated tp_size/H times [--]. DCP shards along tokens (T dimension) reducing duplication [--]. dcp_size range: [--] tp_size/H no extra GPUs needed [--]. Interleaving strategy: future tokens naturally sharded Trade-off: larger dcp = less duplication more communication. Works with both MLA and GQA models. Docs:"
X Link 2026-01-31T16:25Z 31.6K followers, [----] engagements

"๐ŸŽ‰ vLLM-Omni v0.14.0 is officially released our first stable release [---] commits from 70+ contributors (23 new) ship the multimodal stack for production. Highlights: โšก Async chunk pipeline overlap ๐Ÿ—ฃ Qwen3-TTS with online serving ๐ŸŽจ Diffusion LoRA (PEFT-compatible) ๐Ÿง  DiT layerwise CPU offloading ๐Ÿ”Œ XPU / ROCm / NPU backends New models: ๐Ÿฅฏ Bagel (multi-stage pipeline) ๐ŸŽต Stable Audio Open ๐Ÿ–ผ GLM-Image FLUX.1-dev FLUX.2-klein APIs: ๐Ÿ“ธ /v1/images/edit endpoint ๐Ÿฉบ /health & /v1/models for diffusion mode Performance: โšก Torch compile for diffusion ๐Ÿ”ฅ SharedFusedMoE for Qwen3-Omni ๐ŸงŠ TeaCache for"
X Link 2026-02-01T10:56Z 31.2K followers, 25.4K engagements

"๐ŸŽ‰ Congrats to @Alibaba_Qwen on releasing Qwen3-Coder-Next and day-0 support is ready in vLLM 0.15.0 An 80B MoE with only 3B active params matching models 1020x larger. Built for coding agents and local development. Verified on NVIDIA GPUs. Recipe below ๐Ÿ‘‡ ๐Ÿš€ IntroducingQwen3-Coder-Next an open-weight LM built for coding agents & local development. Whats new: ๐Ÿค– Scaling agentic training:800K verifiable tasks + executable envs ๐Ÿ“ˆ EfficiencyPerformance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and https://t.co/P7BmZwdaQ9 ๐Ÿš€ IntroducingQwen3-Coder-Next an"
X Link 2026-02-03T17:44Z 31.4K followers, 28.9K engagements

"๐Ÿ“ˆ vLLM community + @nvidia pushed gpt-oss-120b performance on Blackwell GPUs to new heights: โšก +38% max throughput ๐ŸŽฏ +13% min latency ๐Ÿ“ˆ Entire Pareto frontier improved Key ingredients: FlashInfer integration torch.compile kernel fusions async scheduling and stream interval optimizations. Deep dive + deployment recipes: Thanks to the teams at @NVIDIAAI @RedHat_AI @AIatMeta and the vLLM community for the collaboration ๐Ÿ™ https://blog.vllm.ai/2026/02/01/gpt-oss-optimizations.html https://blog.vllm.ai/2026/02/01/gpt-oss-optimizations.html"
X Link 2026-02-04T01:29Z 31.4K followers, 26K engagements

"30M downloads for NVIDIA Nemotron on Hugging Face. Huge milestone. ๐ŸŽ‰ Really appreciate the ongoing @NVIDIAAIDev x vLLM collaboration helping Nemotron run fast and serve developers efficiently at scale. 30M downloads and counting for the NVIDIA Nemotron family on @huggingface ๐Ÿค— We're grateful for the incredible community that has made this possible. Get started with Nemotron: https://t.co/4AtDcnOCXS https://t.co/cHRF3qbfVi 30M downloads and counting for the NVIDIA Nemotron family on @huggingface ๐Ÿค— We're grateful for the incredible community that has made this possible. Get started with"
X Link 2026-02-04T02:38Z 31.2K followers, [----] engagements

"๐ŸŽ‰ Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year's Eve day-0 support is ready in vLLM Qwen3.5 is a multimodal MoE with hybrid Mamba-attention architecture 397B total params only 17B active. What makes it interesting for inference: ๐Ÿง  Gated Delta Networks + sparse MoE high throughput low latency lower cost ๐ŸŒ [---] languages and dialects supported out of the box ๐Ÿ‘ One model for both text and vision no separate VL pipeline needed Verified on NVIDIA GPUs. Recipes for Docker pip and multi-node deployment ๐Ÿ‘‡ #vLLM #Qwen #OpenSource #Inference"
X Link 2026-02-16T09:47Z 31.6K followers, [----] engagements

"๐ŸŽ‰ Congrats to @intern_lm on Intern-S1-Pro day-0 support in vLLM ๐Ÿ”ฌ Trillion-scale MoE for scientific reasoning: 1T total params [---] experts 22B activated per token. State-of-the-art across AI4Science domains. PR: Serving command (โœ… Verified on NVIDIA GPUs): https://github.com/vllm-project/vllm/pull/33636 ๐Ÿš€Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general https://t.co/cKni28WwQT"
X Link 2026-02-04T13:49Z 31.6K followers, [----] engagements

"๐Ÿš€๐Ÿš€๐Ÿš€ vLLM on NVIDIA GB200: 26.2K prefill TPGS 10.1K decode TPGS for DeepSeek R1/V3. ๐Ÿ“ˆ 3-5x throughput vs H200 - with half the GPUs Key optimizations: - NVFP4 GEMM for MoE experts - FP8 GEMM for MLA - Kernel fusion (RoPE+Quant+Q Write) - Weight offloading v2 with async prefetch Thanks to the @AIatMeta and @NVIDIAAIDev teams for the collaboration ๐Ÿ™ ๐Ÿ”— Blog: https://blog.vllm.ai/2026/02/03/dsr1-gb200.html https://blog.vllm.ai/2026/02/03/dsr1-gb200.html"
X Link 2026-02-04T17:48Z 31.6K followers, 37.3K engagements

"Congrats to @MistralAI on releasing Voxtral Mini 4B Realtime ๐ŸŽ‰ Day-0 support in vLLM A 4B streaming ASR model achieving 500ms latency while matching offline model accuracy supporting [--] languages. vLLM's new Realtime API /v1/realtime provides audio streaming - optimized for voice assistants live subtitles and meeting transcription Thanks to the close collaboration between the vLLM community and @MistralAI for making this production-grade support possible ๐Ÿค ๐Ÿ“‘Model & Usage: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602 Voxtral Realtime is built for voice agents and live"
X Link 2026-02-04T17:51Z 31.6K followers, 30.3K engagements

"๐Ÿš€The vLLM-Omni paper is now on arXiv ๐ŸŽ‰ Congrats to the team on documenting the system design for serving any-to-any multimodal models stage-based decomposition for complex pipelines (LLMs + diffusion + encoders) per-stage batching and flexible GPU allocation. Repo: https://github.com/vllm-project/vllm-omni https://arxiv.org/abs/2602.02204 https://github.com/vllm-project/vllm-omni https://arxiv.org/abs/2602.02204"
X Link 2026-02-05T03:36Z 31.6K followers, 16.5K engagements

"Great writeup from @AI21Labs on scaling vLLM for high-throughput bursty workloads. TL;DR: systematic config tuning + queue-based autoscaling = 2x throughput from the same GPUs. ๐Ÿš€ Useful for anyone running vLLM in production with variable traffic patterns. Thanks to the @AI21Labs team for publishing the full engineering writeup. ๐Ÿ’™ ๐Ÿ”— https://www.ai21.com/blog/scaling-vllm-without-oom/ 1/5 Go Big or Go OOM: The Art of Scaling vLLM ๐ŸŽฏ. We doubled throughput and cut latency in half-same GPUs just better vLLM config then added smart autoscaling to handle traffic bursts. Here's what we learned"
X Link 2026-02-10T12:17Z 31.6K followers, 14.9K engagements

"Part [--] of our AI21 Labs x vLLM series ๐Ÿ‘‡ After scaling insights this one goes deep into debugging: how a 1-in-1000 gibberish failure in vLLM + Mamba was reproduced traced and fixed upstream. Key lesson: even with correct kernel math request classification timing can still corrupt state under memory pressure. Thanks again to the @AI21Labs team for sharing the full investigation and fix details. ๐Ÿ”— https://www.ai21.com/blog/vllm-debugging-mamba-bug/ https://www.ai21.com/blog/vllm-debugging-mamba-bug/"
X Link 2026-02-10T12:57Z 31.6K followers, [----] engagements

"โšกStreaming input + ๐ŸŽ™Realtime WebSocket API built in collaboration with @Meta and @MistralAI. First among all popular open-source LLM inference engines. Check out the design and examples in http://blog.vllm.ai/2026/01/31/streaming-realtime.html http://blog.vllm.ai/2026/01/31/streaming-realtime.html"
X Link 2026-02-11T01:27Z 31.6K followers, 11.5K engagements

"๐Ÿš€ vLLM just hit 70K GitHub stars ๐ŸŽ‰ The engine has kept evolving fast since the last milestone. We've been pushing hard on large-scale serving production-grade multi-node support on NVIDIA Blackwell with WideEP and expert parallelism making it practical to serve the biggest models at scale. More models more hardware async scheduling for higher throughput real-time streaming for speech and audio and a growing multimodal story across text vision video and voice. Huge thanks to our sponsors our 2100+ contributors friends at @PyTorch @huggingface Transformers and the model labs we work closely"
X Link 2026-02-11T08:35Z 31.6K followers, [----] engagements

"๐Ÿ”ฅCongrats to @Zai_org on launching GLM-5 744B parameters (40B active) trained on 28.5T tokens integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: ๐Ÿ“– DeepSeek Sparse Attention for efficient long-context serving โšก MTP speculative decoding โš™ Tool calling + thinking mode Recipe with serving configs and benchmarks: ๐Ÿ”— https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html"
X Link 2026-02-11T18:44Z 31.6K followers, 41.9K engagements

"Exciting evolution for the Pytorch Foundation family ๐ŸŽ‰ Congrats to @matthew_d_white on stepping into the CTO role at PyTorch Foundation and welcome @sparkycollier as the new Executive Director As the foundation scales to support @PyTorch @vllm_project @DeepSpeedAI and @raydistributed under one umbrella stronger leadership means stronger open source AI infrastructure for everyone. Looking forward to what's ahead ๐Ÿ™Œ Over the past two years PyTorch Foundation has evolved into an umbrella foundation hosting PyTorch @vllm_project @DeepSpeedAI & @raydistributed with expanded governance global"
X Link 2026-02-12T12:26Z 31.6K followers, [----] engagements

"๐Ÿš€ DeepSeek R1 on GB300 with vLLM: 22.5K prefill TGS and 3K decode TGS per GPU an 8x prefill and 10-20x mixed-context improvement over Hopper. DeepSeek V3.2 on [--] GPUs (NVFP4 + TP2): 7.4K prefill TGS and 2.8K decode TGS. Key recipe: โšก NVFP4 weights from HuggingFace โšก FlashInfer FP4 MoE kernel (VLLM_USE_FLASHINFER_MOE_FP4=1) โšก [--] GPUs is all you need 288GB HBM per GPU EP vs TP MTP tuning disaggregated prefill all covered. ๐Ÿ™ Collab with @daocloud_io & @verdacloud ๐Ÿ”— Blog: https://blog.vllm.ai/2026/02/13/gb300-deepseek.html https://blog.vllm.ai/2026/02/13/gb300-deepseek.html"
X Link 2026-02-13T13:56Z 31.6K followers, 11.9K engagements

"๐ŸŽ‰ Congrats @MiniMax_AI on M2.5 vLLM has day-0 support SOTA coding (80.2% SWE-Bench Verified) agentic search (76.3% BrowseComp) trained on 200k+ real-world RL environments. 37% faster than M2.1 matching Opus [---] speed. ๐Ÿš€ โœ…Verified on NVIDIA GPUs. Recipe (Docker & pip): https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments it delivers SOTA performance in coding agentic tool use search and office workflows. Hugging Face: https://t.co/Wxksq9BB7t"
X Link 2026-02-13T14:06Z 31.6K followers, 23.2K engagements

"๐Ÿ”ฅExcited to see SkyRL bringing Tinker to local GPUs. Standardizing training APIs lowers the barrier for research and infrastructure innovation. vLLM is proud to power the inference layer behind high-throughput RL training. ๐Ÿš€ SkyRL now implements the Tinker API. Now training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2 Megatron and vLLM backends. Blog: https://t.co/GAtW81jM38 ๐Ÿงต https://t.co/shLZSjdi5x SkyRL now implements the Tinker API. Now training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's"
X Link 2026-02-14T07:38Z 31.6K followers, [----] engagements

"vLLM is coming to Hong Kong join us on March [--] for a full-day meetup on LLM inference multimodal serving and multi-hardware optimization. Talks & workshops from @vllm_project core team @RedHat_AI @AIatAMD MetaX and @MiniMax_AI and more. ๐Ÿ”— Register: https://www.vantagemind.com/events/vLLM/260307/vLLM-HK-Meetup_vLLM.html https://www.vantagemind.com/events/vLLM/260307/vLLM-HK-Meetup_vLLM.html"
X Link 2026-02-14T09:15Z 31.6K followers, [----] engagements

"๐ŸŽ‰ Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year's Eve day-0 support is ready in vLLM Qwen3.5 is a multimodal MoE with Gated Delta Networks architecture 397B total params only 17B active. What makes it interesting for inference: ๐Ÿง  Gated Delta Networks + sparse MoE high throughput low latency lower cost ๐ŸŒ [---] languages and dialects supported out of the box ๐Ÿ‘ One model for both text and vision no separate VL pipeline needed Verified on NVIDIA GPUs. Recipes for Docker pip and multi-node deployment ๐Ÿ‘‡ #vLLM #Qwen #OpenSource #Inference"
X Link 2026-02-16T10:17Z 31.6K followers, 45.4K engagements

"Join us on March [--] for an evening of deep technical talks on inference IDE integrations omni-modality and Kubernetes-scale serving. ๐Ÿ“… March [--] 5:00 PM - 9:00 PM (GMT+1) Warsaw Poland See you in Warsaw ๐Ÿ‘‹ Warsaw friends vLLM Inference Meetup is happening March [--]. @jetbrains @RedHat and @nvidia are bringing @vllm_project maintainers + real demos plus an optional hands-on workshop earlier in the day. If you build or run inference come meet the people shipping it. ๐Ÿ‘‹ Warsaw friends vLLM Inference Meetup is happening March [--]. @jetbrains @RedHat and @nvidia are bringing @vllm_project"
X Link 2026-02-18T02:20Z 31.6K followers, [----] engagements

"Great work We love how @vllm_project is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked This is a small feature we implemented to make RLHF smoother with vLLM. Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini) ๐Ÿงต https://t.co/I3UXLGKFNb Our research interns present: Kevin-32B = K(ernel D)evin It's the first"
X Link 2025-05-07T01:45Z 31.3K followers, 12.9K engagements

""inference uses @vllm_project" Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2 Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2"
X Link 2025-05-12T05:39Z 31.2K followers, 10.7K engagements

"uv pip install -U vLLM The latest release features [---] commits from [---] contributors. vLLM is now ready for @nvidia Blackwell with the latest @PyTorch [---] upgrade. Huge thanks to @NVIDIAAIDev and @ye_combinator for the CUTLASS and FlashInfer kernels"
X Link 2025-05-28T01:54Z 31.2K followers, 17.7K engagements

"Congrats on the launch vLLM is proud to support the new Qwen3 embedding models check it out ๐Ÿ‘‰๐Ÿป https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐Ÿš€ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series setting new standards in multilingual text embedding and relevance ranking โœจ Highlights: โœ… Available in 0.6B / 4B / 8B versions โœ… Supports [---] languages โœ… State-of-the-Art performance on MMTEB MTEB https://t.co/qNu0rswSol https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐Ÿš€ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series"
X Link 2025-06-05T15:45Z 31.3K followers, 12.2K engagements

"uv pip install -U vllm --extra-index-url --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today ๐Ÿ”ฎ https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh"
X Link 2025-06-10T14:37Z 31.2K followers, 11.7K engagements

"๐Ÿ‘€ Look what just arrived at @BerkeleySky ๐ŸŒŸ A shiny MI355X system. Huge thanks to @AMD for supporting open source and we are looking forward to getting it set up in the next few days"
X Link 2025-06-10T17:20Z 31.3K followers, 20.7K engagements

"Thank you @AMD @LisaSu @AnushElangovan for Advancing AI together with @vllm_project We look forward to the continued partnership and pushing the boundary of inference"
X Link 2025-06-13T23:54Z 31.2K followers, 17.3K engagements

"Congrats on the launch vLLM is proud to support this great model on day [--] looking forward to the following releases Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency: https://t.co/bGfDlZA54n Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output -"
X Link 2025-06-16T16:14Z 31.3K followers, 19K engagements

"vLLM has just reached 50K github stars Huge thanks to the community๐Ÿš€ Together let's bring easy fast and cheap LLM serving for everyoneโœŒ๐Ÿป"
X Link 2025-06-19T05:25Z 31.2K followers, 18.3K engagements

""The second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window" Checkout the blog post from @minimax_ai on how the model is implemented on vLLM and how you can run this model efficiently https://blog.vllm.ai/2025/06/30/minimax-m1.html MiniMax launches their first reasoning model: MiniMax M1 the second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window @minimax_ai M1 is based on their Text-01 model (released [--] Jan 2025) - an MoE with 456B total and 45.9B active https://t.co/JltMYrm0te"
X Link 2025-07-01T04:16Z 31.2K followers, 13.9K engagements

"We genuinely want to solve this problem. As many (@Rxday000 @samsja19 @danielhanchen @_EldarKurtic and more) chimed in the reason includes attention kernels matmul reduction order precisions in various operators and more horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh"
X Link 2025-07-06T21:30Z 31.2K followers, 37.3K engagements

"vLLM runs on free-threaded Python A group of engineers from @Metas Python runtime language team has shown that its possible to run vLLM on the nogil distribution of Python. Were incredibly excited to embrace this future technique and be early adopters ๐Ÿ˜"
X Link 2025-07-08T05:06Z 31.2K followers, 50.6K engagements

"@Kimi_Moonshot just released a trillion-parameter model with great agentic capability and it is already supported in vLLM Have a try with a simple command and check the doc for more advanced deployment๐Ÿš€ ๐Ÿš€ Hello Kimi K2 Open-Source Agentic Model ๐Ÿ”น 1T total / 32B active MoE model ๐Ÿ”น SOTA on SWE Bench Verified Tau2 & AceBench among open models ๐Ÿ”นStrong in coding and agentic tasks ๐Ÿค Multimodal & thought-mode not supported for now With Kimi K2 advanced agentic intelligence https://t.co/PlRQNrg9JL ๐Ÿš€ Hello Kimi K2 Open-Source Agentic Model ๐Ÿ”น 1T total / 32B active MoE model ๐Ÿ”น SOTA on SWE Bench"
X Link 2025-07-11T15:06Z 31.4K followers, 34.4K engagements

"Thanks for the great write-up ๐Ÿ™Œ Prefix caching is critical for agentic workflows like @manusai and vLLM makes it seamless. โœ… prefix caching is enabled by default with an efficient implementation โœ… Append-only context Cache hit heaven Context engineering FTW ๐Ÿš€ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ"
X Link 2025-07-19T14:20Z 31.2K followers, 15K engagements

"The @huggingface Transformers @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers you can now run it directly with vLLM. Great work @RTurganbay ๐Ÿ‘ https://github.com/vllm-project/vllm/pull/20543 https://github.com/vllm-project/vllm/pull/20543"
X Link 2025-07-22T20:32Z 31.2K followers, 22.3K engagements

"โœ… Try out @Alibaba_Qwen [--] Coder on vLLM nightly with "qwen3_coder" tool call parser Additionally vLLM offers expert parallelism so you can run this model in flexible configurations where it fits. Qwen3-Coder is here โœ… Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves https://t.co/Z8HfyrVScE Qwen3-Coder is here โœ… Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to"
X Link 2025-07-22T22:06Z 31.2K followers, 34.2K engagements

"This amazing Attention-FFN disaggregation implementation from @StepFun_ai achieves decoding throughput of up to [----] tokens per second per GPU under 50ms TPOT SLA for their 321B-A38B MoE model Step3 served with H800 The implementation is based on vLLM and we are working together to bring it to the public Kudos to @StepFun_ai ๐Ÿš€ Check out their tech report at . https://github.com/stepfun-ai/Step3 https://github.com/stepfun-ai/Step3"
X Link 2025-07-26T01:26Z 31.2K followers, 33.6K engagements

"๐Ÿ”ฅ vLLM @ PyTorch Conference [----] ๐Ÿ”ฅ Were excited to share that [--] talks at this years PyTorch Conference will feature vLLM Topics include: Easy & Fast LLM Serving Open-Source Post-Training Stack Scaling Online LLM Training AMD GPU support via Triton vllm-triton backend performance Stay tuned & come say hi ๐Ÿš€ #vLLM #PyTorch #LLM #AI #opensource ๐Ÿ‘€ ICYMI: The #PyTorchConf schedule is now live https://t.co/YSAdVaiWRk Were talking [--] days of cutting-edge talks on #LLM scaling real-time inference model optimization & more straight from the #PyTorch community. ๐Ÿ“ Oct [----] San Francisco ๐ŸŽŸ Register:"
X Link 2025-07-31T07:31Z 31.2K followers, 29.1K engagements

"Thank you @OpenAI for open-sourcing these great models ๐Ÿ™Œ Were proud to be the official launch partner for gpt-oss (20B & 120B) now supported in vLLM ๐ŸŽ‰ โšก MXFP4 quant = fast & efficient ๐ŸŒ€ Hybrid attention (sliding + full) ๐Ÿค– Strong agentic abilities ๐Ÿš€ Easy deployment ๐Ÿ‘‰๐Ÿป Check out the blog and recipes for more details ๐Ÿ”ฅ #vLLM #gptOSS #openai https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html https://blog.vllm.ai/2025/08/05/gpt-oss.html Our open models are here. Both of them. https://t.co/9tFxefOXcg https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html"
X Link 2025-08-05T17:31Z 31.2K followers, 44.3K engagements

"๐Ÿ‘€ we care a lot about correctness ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified if you run into any correctness issue on vLLM we would love to know and debug them Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. Were working with inference providers to make sure gpt-oss performs at its best everywhere and wed love your feedback Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across"
X Link 2025-08-06T03:51Z 31.2K followers, 66.7K engagements

"๐Ÿš€ Amazing community project vLLM CLI a command-line tool for serving LLMs with vLLM: โœ… Interactive menu-driven UI & scripting-friendly CLI โœ… Local + HuggingFace Hub model management โœ… Config profiles for perf/memory tuning โœ… Real-time server & GPU monitoring โœ… Error logs & recovery ๐Ÿ“ฆ Install in one line: pip install vllm-cli GitHub: ๐Ÿ‘‰ Would you like to see these features in vLLM itself Try it out & share feedback https://github.com/Chen-zexi/vllm-cli https://github.com/Chen-zexi/vllm-cli"
X Link 2025-08-17T08:52Z 31.2K followers, 72.8K engagements

"๐Ÿš€ GLM-4.5 meets vLLM @Zai_org 's latest GLM-4.5 & GLM-4.5V models bring hybrid reasoning coding & intelligent agent capabilitiesnow fully supported in vLLM for fast efficient inference on NVIDIA Blackwell & Hopper GPUs Read more ๐Ÿ‘‰ https://blog.vllm.ai/2025/08/19/glm45-vllm.html https://blog.vllm.ai/2025/08/19/glm45-vllm.html"
X Link 2025-08-19T09:10Z 31.2K followers, 16.9K engagements

"๐Ÿš€ Exciting news: DeepSeek-V3.1 from @deepseek_ai now runs on vLLM ๐Ÿง  Seamlessly toggle Think / Non-Think mode per request โšก Powered by vLLMs efficient serving scale to multi-GPU with ease ๐Ÿ›  Perfect for agents tools and fast reasoning workloads ๐Ÿ‘‰ Guide & examples: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_1.html Introducing DeepSeek-V3.1: our first step toward the agent era ๐Ÿš€ ๐Ÿง  Hybrid inference: Think & Non-Think one model two modes โšก Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 ๐Ÿ›  Stronger agent skills: Post-training"
X Link 2025-08-21T17:20Z 31.2K followers, 23.6K engagements

"Wow glad to see vLLM powers @jiawzhao 's DeepConf work impressive results on AIME [----] Do you think this sampling control makes sense Have a try and leave a comment in that PR to let us know https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf: Deep Think with Confidence ๐Ÿš€ First method to achieve 99.9% on AIME [----] with open-source models Using GPT-OSS-120B even without tools we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong https://t.co/MlhDUKmawH https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf:"
X Link 2025-08-23T15:31Z 31.3K followers, 48.3K engagements

"๐ŸŒ vLLM Shanghai Meetup Recap ๐Ÿš€ Last weekend we gathered with the community in Shanghai to dive into: Contributing to vLLM Distributed inference ERNIE [---] integration Mooncake + LMCache MetaX hardware support The community is pushing vLLM to new levels of performance scalability & adaptability. ๐Ÿ“‘ Event notes: ๐Ÿ“‘ Slides: https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg"
X Link 2025-08-25T08:59Z 31.2K followers, 12.1K engagements

"๐Ÿš€ LLM Compressor v0.7.0 is here This release brings powerful new features for quantizing large language models including transform support (QuIP SpinQuant) mixed precision compression improved MoE handling with Llama4 support and more. Full blog: More info below ๐Ÿ‘‡ https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap"
X Link 2025-08-26T20:02Z 31.2K followers, 16.8K engagements

"๐Ÿš€ vLLM now supports Kwai Keye-VL-1.5 With sharper video ๐Ÿ“น & image ๐Ÿ–ผ comprehension stronger reasoning and an extended 128K context length this model unlocks richer conversations and more complex tasks than ever before. Upgrade to the nightly build and experience it today Check out for more details. #AI #vLLM #KeyeVL https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B"
X Link 2025-09-01T13:36Z 31.2K followers, 10.7K engagements

"vLLM is proud to support the great Kimi update from @Kimi_Moonshot better tool-calling longer context and more Check the deployment guide at ๐Ÿ”ฅ https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/deploy_guidance.md Kimi K2-0905 update ๐Ÿš€ - Enhanced coding capabilities esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g. Claude Code Roo Code etc) ๐Ÿ”— Weights & code: https://t.co/83sQekosr9 ๐Ÿ’ฌ Chat with new Kimi https://t.co/mkOuBMwzpw"
X Link 2025-09-05T03:26Z 31.4K followers, 18.6K engagements

"The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost (after more proofreading and clarifications) Looking forward to future series of tech deep dive blogposts๐Ÿ˜ https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog"
X Link 2025-09-09T01:31Z 31.3K followers, 48.2K engagements

"Wow thanks to @charles_irl you can understand internals of vLLM with a live notebook from @modal ๐Ÿฅฐ I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM"
X Link 2025-09-10T09:27Z 31.2K followers, 30.7K engagements

"โšก Efficient weight updates for RL at trillion-parameter scale ๐Ÿ’ก Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: Broadcast weight sync for 1T params in 20s across 1000s of GPUs Dynamic P2P updates for elastic clusters Optimized pipeline w/ overlapped H2D broadcast & reload Open source & ready for large-scale RL with vLLM ๐Ÿš€ Introducing checkpoint-engine: our open-source lightweight middleware for efficient in-place weight updates in LLM inference engines especially effective for RL. โœ… Update a 1T model on thousands of GPUs in 20s โœ… Supports both"
X Link 2025-09-10T17:06Z 31.2K followers, 24K engagements

"Thank you @cHHillee for the great explanation and demo of how to implement deterministic inference on vLLM https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is Defeating Nondeterminism in LLM Inference We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to https://t.co/jMFL3xt67C https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is"
X Link 2025-09-10T17:41Z 31.2K followers, 16.3K engagements

"Deep dive into optimizing weight transfer step by step and improving it 60x [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH"
X Link 2025-09-10T20:46Z 31.3K followers, 18.5K engagements

"Welcome Qwen3-Next You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐Ÿš€ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐Ÿ”น 80B params but only 3B activated per token 10x cheaper training 10x faster inference than Qwen3-32B.(esp. @ 32K+ context) ๐Ÿ”นHybridArchitecture:GatedDeltaNet+GatedAttentionbestofspeed& https://t.co/yO7ug721U6 https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐Ÿš€ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐Ÿ”น 80B params but only"
X Link 2025-09-11T19:38Z 31.4K followers, 61.7K engagements

"v0.10.2 marks the first release with official aarch64 support for vLLM You can now install vLLM directly onto @nvidia 's GB200. Along with the PyPI release our docker image is also multi-platform so pulling the right image just works. More perf enhancements on the way"
X Link 2025-09-16T00:49Z 31.3K followers, 17.9K engagements

"Congrats to @deepseek_ai DeepSeek-R1 was published in Nature yesterday as the cover article and vLLM is proud to have supported its RL training and inference๐Ÿฅฐ"
X Link 2025-09-18T02:44Z 31.2K followers, 214.3K engagements

"Pro Tip๐Ÿ’กFast and simple way to deploy DeepSeek-V3.1-Terminus with vLLM โšก Run it with: vllm serve deepseek-ai/DeepSeek-V3.1-Terminus -tp [--] -dcp [--] (as simple as appending -dcp [--] after -tp 8) Thanks to the @Kimi_Moonshot team vLLM 0.10.2 adds Decode Context Parallel (DCP) support: โœ… Cuts KV cache duplication by sharding across GPUs โœ… [--] larger KV cache โœ… [--] throughput gain on single-node H200 Perfect for KV-cache hungry tasks (RL offline data generation). More blogposts diving into DCP are coming soonand optimizations for general GQA models are on the way ๐Ÿš€ #vLLM #DeepSeek #AIInfra ๐Ÿš€"
X Link 2025-09-24T11:35Z 31.3K followers, 44K engagements

"How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of [---] per token (vs. [---] for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA. ๐Ÿš€ Introducing DeepSeek-V3.2-Exp our latest experimental model โœจ Built on V3.1-Terminus it debuts DeepSeek Sparse Attention(DSA) for faster more efficient training & inference on long context. ๐Ÿ‘‰ Now live on App Web and API. ๐Ÿ’ฐ API prices cut by 50%+ 1/n ๐Ÿš€ Introducing DeepSeek-V3.2-Exp our latest experimental model โœจ"
X Link 2025-09-29T10:59Z 31.2K followers, 103.3K engagements

"Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai vLLM is here to help We have verified that it works on H200 machines and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes for more details ๐Ÿ˜ Note: currently the PR is still pending but you can use our pre-compiled wheels to build directly and use the model We will push the model into main branch very soon and add many more optimizations. Stay tuned https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the"
X Link 2025-09-29T14:05Z 31.2K followers, 12.5K engagements

"Keeping BERT alive in vLLM via transformers Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE"
X Link 2025-10-02T17:40Z 31.3K followers, 17.8K engagements

"๐Ÿš€ vLLM x MinerU: Document Parsing at Lightning Speed Were excited to see MinerU fully powered by vLLM bringing ultra-fast accurate and efficient document understanding to everyone. โšก Powered by vLLMs high-throughput inference engine MinerU [---] delivers: Instant parsing no waiting Deeper understanding for complex docs Optimized cost even consumer GPUs can fly Experience the new speed of intelligence: ๐Ÿ‘‰ #vLLM #MinerU #AI #LLM #DocumentParsing #AIresearch https://github.com/opendatalab/MinerU MinerU [---] has arrived with demo on @huggingface https://t.co/KZyL6TJDSe"
X Link 2025-10-11T11:11Z 31.2K followers, 71.7K engagements

"๐Ÿš€ vLLM just hit 60K GitHub stars ๐ŸŽ‰ From a small research idea to powering LLM inference everywhere across NVIDIA AMD Intel Apple TPUs and more vLLM now supports almost all major text-generation models and native RL pipelines like TRL Unsloth Verl and OpenRLHF. Huge thanks to our amazing community and friends at @PyTorch @huggingface Transformers and model vendors from @AIatMeta Llama to @OpenAI GPT-OSS @Alibaba_Qwen Qwen @deepseek_ai DeepSeek and @Kimi_Moonshot Kimi and many others (sorry we ran out of space) for making this ecosystem thrive. โค Let's head to the next chapter of efficient"
X Link 2025-10-13T13:13Z 31.2K followers, 39.1K engagements

"Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. ๐Ÿš€ What's New - JAX + Pytorch: Run PyTorch models on TPUs with no code changes now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program Multi-Data (SPMD) as the default a"
X Link 2025-10-16T16:08Z 31.2K followers, 157.8K engagements

"kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out ๐Ÿš€ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first library for elastic GPU sharing across LLMs. ๐Ÿ”— https://t.co/3BC7B6s2EX ๐Ÿงต๐Ÿ‘‡ Why it matters: https://t.co/jdIg1gyyOS ๐Ÿš€ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first"
X Link 2025-10-21T23:22Z 31.4K followers, 55.7K engagements

"its tokenization again ๐Ÿคฏ did you know tokenize(detokenize(token_ids)) token_ids RL researchers from Agent Lightning coined the term Retokenization Drift a subtle mismatch between what your model generated and what your trainer thinks it generated. why because most agents call LLMs via OpenAI-compatible APIs that only return strings so when those strings get retokenized later token splits may differ (HAV+ING vs H+AVING) tool-call JSON may be reformatted or chat templates may vary. unstable learning off-policy updates training chaos. ๐Ÿ˜ฌ (@karpathy has a great video explaining all details about"
X Link 2025-10-22T15:17Z 31.2K followers, 171K engagements

"๐Ÿš€ Excited to share our work on batch-invariant inference in vLLM Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill). Let's dive into how we built this ๐Ÿงต๐Ÿ‘‡"
X Link 2025-10-22T20:02Z 31.2K followers, 38.8K engagements

"๐ŸŽ‰@Kimi_Moonshot co-founder @ppwwyyxx talking about Moonshots Decode Context Parallel open source contribution to @vllm_project at @PyTorch conf"
X Link 2025-10-24T00:20Z 31.3K followers, 17.1K engagements

"vLLM ๐Ÿค @nvidia = open scalable agentic AI you can run anywhere. ๐Ÿงต Strengthening our partnership with @nvidia: vLLM serves the NVIDIA Nemotron family. This new blog from NVIDIA walks through how to deploy open high-accuracy agentic inference across data center and edgefast reproducible and production-ready. ๐Ÿ”Ž Model highlight Nemotron Nano [--] (9B): a small language reasoning model with a hybrid TransformerMamba design and a tunable thinking budget. Open weights + 9T tokens of open data on Hugging Face (permissive license). Excels at reasoning/coding instruction following tool calling and"
X Link 2025-10-24T02:50Z 31.2K followers, 19.2K engagements

"Wow OCR models are taking off in vLLM ๐Ÿ˜ Small but powerful ๐Ÿ’ช Enjoy this fast OCR model from @staghado โœŒ You might have seen a lot of OCR release recently. Here is another one introducing ๐Ÿฆ‰ LightOnOCR-1B A fully end-to-end differentiable VLM model competing with all the latest releases while being much faster๐Ÿš€ https://t.co/jSaku5huLb You might have seen a lot of OCR release recently. Here is another one introducing ๐Ÿฆ‰ LightOnOCR-1B A fully end-to-end differentiable VLM model competing with all the latest releases while being much faster๐Ÿš€ https://t.co/jSaku5huLb"
X Link 2025-10-24T04:33Z 31.3K followers, 17.9K engagements

"๐ŸŽ‰ Congrats to the @minimax_ai team for releasing MiniMax-M2 model Built for advanced coding and agentic tasks MiniMax-M2 is now available with Day-0 support on vLLM bringing fast efficient inference and smooth long-context performance. vLLM is proud to power the next generation of deployable intelligence. Check for the latest usage guide #MiniMax #vLLM #AI https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html Were open-sourcing MiniMax M2 Agent & Code Native at 8% Claude Sonnet price 2x faster โšก Global FREE for a limited time via MiniMax Agent & API - Advanced Coding"
X Link 2025-10-27T05:07Z 31.2K followers, 149.2K engagements

"๐Ÿš€ New in vLLM Semantic Router: Parallel LoRA execution Lock-free concurrency with OnceLock Flash Attention [--] for [--] faster inference Rust Go FFI for cloud-native ML Smarter routing safer concurrency faster inference all in one release. ๐Ÿ”— Full blog: #vLLM #LoRA #FlashAttention #AIInfra #SemanticRouter https://blog.vllm.ai/2025/10/27/semantic-router-modular.html https://blog.vllm.ai/2025/10/27/semantic-router-modular.html"
X Link 2025-10-27T14:15Z 31.2K followers, 24.7K engagements

"vLLM Sleep Mode ๐Ÿ˜ด โšกZero-reload model switching for multi-model serving. Benchmarks: [-----] faster switches and 6188% faster first inference vs cold starts. Explanation Blog by @EmbeddedLLM ๐Ÿ‘‡ Why its fast: we keep the process alive preserving the allocator CUDA graphs and JIT-compiled kernels. Cold starts pay these costs every time. Two levels: L1 offloads weights to CPU (fastest wake). L2 discards weights (minimal RAM). Both are dramatically faster than full reloads. Works from multiple GPU TP/PP/EP. Read the full benchmarks + decision guide by @EmbeddedLLM Blog link"
X Link 2025-10-28T07:12Z 31.2K followers, 34.8K engagements

"โœจ At vLLM we strive for correctness reliability and open collaboration every detail matters. Together with @Kimi_Moonshot we verified Kimi K2s tool-calling accuracy on vLLM using the latest K2-Vendor-Verifier benchmark. Our debugging uncovered [--] key compatibility issues all now fixed โœ…: [--] Missing add_generation_prompt [--] Empty content handling [--] Strict tool-call ID parsing ๐Ÿ” After fixes: Kimi K2 models now reach 99.9% request success rate and 76% schema accuracy on vLLM a [---] improvement We also found that some schema errors came from tools called outside the current context a known"
X Link 2025-10-28T10:15Z 31.3K followers, 17K engagements

"Wow excited to see PewDiePie using vLLM to serve language models locally ๐Ÿ˜ƒ vLLM brings easy fast and cheap LLM serving for everyone ๐Ÿฅฐ PewDiePie in 2025: built a [------] rig runs Llama 70B gpt-oss-120B & Qwen 245B locally via vLLM built a custom web UI (chat RAG search TTS) ran protein-folding simulations for charity created an AI council a swarm of [--] models now fine-tuning his own model https://t.co/7otKqRH0D5 PewDiePie in 2025: built a [------] rig runs Llama 70B gpt-oss-120B & Qwen 245B locally via vLLM built a custom web UI (chat RAG search TTS) ran protein-folding simulations for charity"
X Link 2025-11-03T07:02Z 31.2K followers, 166K engagements

"Wow Quantization-enhanced Reinforcement Learning using vLLM Great job by @yukangchen_ ๐Ÿ˜ƒ We open-sourced QeRL Quantization-enhanced Reinforcement Learning ๐Ÿง  4-bit quantized RL training ๐Ÿ’ช Train a 32B LLM on a single H100 GPU โš™ [---] faster overall training ๐ŸŽฏ Accuracy on par with bfloat16-level accuracy ๐Ÿ”ฅ Supports NVFP4 quantization format Moreover we show https://t.co/56GP7H8SEC We open-sourced QeRL Quantization-enhanced Reinforcement Learning ๐Ÿง  4-bit quantized RL training ๐Ÿ’ช Train a 32B LLM on a single H100 GPU โš™ [---] faster overall training ๐ŸŽฏ Accuracy on par with bfloat16-level accuracy"
X Link 2025-11-04T02:40Z 31.2K followers, 23.5K engagements

"Amazing work by @RidgerZhu and the ByteDance Seed team Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. ๐Ÿ”ฅ The Ouro model is now runnable on vLLM (nightly version) bringing efficient inference to this new paradigm of latent reasoning. Thrilled to release new paper: Scaling Latent Reasoning via Looped Language Models. TLDR: We scale up loop language models to [---] billion parameters and pretrained on [--] trillion tokens. The resulting model is on par with SOTA language models of [--] to 3x size. https://t.co/6iauhVZ83g Thrilled to release new paper:"
X Link 2025-11-04T13:06Z 31.2K followers, 32.3K engagements

"๐Ÿ˜‰ Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped https://t.co/tmZHPpCw3H https://t.co/BESzbBKtl3 Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped https://t.co/tmZHPpCw3H https://t.co/BESzbBKtl3"
X Link 2025-11-05T14:11Z 31.3K followers, 27.7K engagements

"๐Ÿš€ Day [--] support: Kimi K2 Thinking now running on vLLM In partnership with @Kimi_Moonshot we're proud to deliver official support for the state-of-the-art open thinking model with 1T params 32B active. Easy deploy in vLLM (nightly version) with OpenAI-compatible API: What makes it special: โšก Native INT4 quantization [--] faster inference ๐Ÿ’พ Half the memory footprint no accuracy loss ๐ŸŽฏ 256K context stable across 200-300 tool calls ๐ŸŽฏ Official recipe & deployment guide included World-class reasoning now accessible to everyone. ๐Ÿ“ฆ Model: ๐Ÿ“š Recipes: #vLLM #KimiK2 #LLMInference"
X Link 2025-11-06T15:29Z 31.2K followers, 34.4K engagements

"vLLM now optimized for @Intel Arc Pro B-Series GPUs ๐Ÿš€ Delivering accessible high-performance LLM serving with exceptional price-to-performance. Key highlights: 80%+ MoE hardware efficiency with persistent zero-gap kernels 24GB VRAM [---] GB/s bandwidth [---] AI engines Full TP/PP/EP support for multi-GPU scaling DeepSeek distilled models (8B-70B) GPT-OSS Qwen validated MLPerf v5.1: Arc Pro B60 shows strong performance-per-dollar on Llama 8B ๐Ÿ“Š Advanced optimizations: Single persistent kernel loop (eliminates launch overhead) Dynamic load balancing with atomic scheduling Hardware-friendly"
X Link 2025-11-11T15:29Z 31.2K followers, 16.4K engagements

"๐Ÿš€ No More TrainInference Mismatch We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) the first open-source run where training and inference numerics match exactly. It only takes [--] steps: [--] Make vLLM batch-invariant (same seq same output regardless of batching) [--] Ensure forward passes in training use identical kernels as inference [--] Add custom backward passes in PyTorch โœ… Verified on Qwen3 1.7B + GSM8K: batch_inv_ON (bitwise exact) KL=0.0 faster convergence higher reward batch_inv_OFF reduced reward instability We audited every op imported vLLMs fused"
X Link 2025-11-12T16:47Z 31.2K followers, 91.4K engagements

"uv helps vLLM out of dependency hell The simple vLLM nightly index installation has been broken by xformers for a while and people have to use a fairly complicated command to work around. With the new --exclude feature from uv 0.9.8 we can restore the simplicity of installing vLLM uv self update echo xformers exclude.txt uv pip install -U vllm --extra-index-url --excludes exclude.txt @charliermarsh is going to further simplify the command to directly take package names from --excludes. Interested Please help voice out https://wheels.vllm.ai/nightly https://wheels.vllm.ai/nightly"
X Link 2025-11-13T04:51Z 31.2K followers, 57.5K engagements

"Check out a great framework built upon vLLM to serve Any-to-Any multimodal models Cornserve ๐ŸŒฝ is a distributed serving platform for complex Any-to-Any multimodal models https://t.co/Kr5lrq828i Cornserve ๐ŸŒฝ is a distributed serving platform for complex Any-to-Any multimodal models https://t.co/Kr5lrq828i"
X Link 2025-11-17T05:33Z 31.4K followers, 22.4K engagements

"When we added support for gpt-oss the Responses API didn't have a standard and we essentially reverse-engineered the protocol by iterating and guessing based on the behavior. We are very excited about the Open Responses spec: clean primitives better tooling consistency for the win Today were announcing Open Responses: an open-source spec for building multi-provider interoperable LLM interfaces built on top of the original OpenAI Responses API. โœ… Multi-provider by default โœ… Useful for real-world workflows โœ… Extensible without fragmentation Build https://t.co/SJiBFx1BOF Today were announcing"
X Link 2026-01-16T04:14Z 31K followers, 60.4K engagements

"Day-0 support for GLM-4.7-Flash is now available in vLLM ๐Ÿš€ A new standard for the 30B classefficient lightweight and powerful for coding & agents. Also great for creative writing translation and long-context tasks. PR: https://github.com/vllm-project/vllm/pull/31386 Introducing GLM-4.7-Flash: Your local coding and agentic assistant. Setting a new standard for the 30B class GLM-4.7-Flash balances high performance with efficiency making it the perfect lightweight deployment option. Beyond coding it is also recommended for creative writing https://t.co/gd7hWQathC"
X Link 2026-01-20T01:21Z 31K followers, 32.5K engagements

"Quick vLLM Tip ๐Ÿ’Ž Batch Invariance Deterministic offline inference Same prompt + different batch size can produce different outputs. Fix: VLLM_BATCH_INVARIANT=1 โœ… Docs: https://docs.vllm.ai/en/latest/features/batch_invariance/#offline-inference https://docs.vllm.ai/en/latest/features/batch_invariance/#offline-inference"
X Link 2026-01-20T15:00Z 30.5K followers, 11.3K engagements

"Quick update: from v0.14.0 onward CI ships ROCm Python wheels + Docker images by default. That means you can pip install or pull prebuilt images without extra build steps. Nightly builds are still on the way. Try it here: Hope this helps #vLLM #ROCm #AMD http://vllm.ai/#quick-start http://vllm.ai/#quick-start http://vllm.ai/#quick-start http://vllm.ai/#quick-start"
X Link 2026-01-21T10:56Z 30.5K followers, 17.7K engagements

"๐Ÿ“ Heading to AAAI [----] in Singapore Come meet the vLLM-Omni team at Expo Hall [--] Booth A50 ๐Ÿ—“ Saturday Jan [--] 11:30 AM 12:30 PM We'll be sharing how vLLM-Omni unifies LLM vision and diffusion workloads into a single inference stackand what's coming next. See you there #AAAI2026 #vLLM #vLLMOmni #OpenSource https://twitter.com/i/web/status/2014279556140863793 https://twitter.com/i/web/status/2014279556140863793"
X Link 2026-01-22T10:10Z 30.5K followers, [----] engagements

"Congrats to the @Alibaba_Qwen team on Qwen3-TTS ๐ŸŽ‰ vLLM-Omni is ready with day-0 support voice cloning voice design and natural language control for emotion & prosody all running natively. Offline inference available now via PR #895 online serving coming soon. ๐Ÿ”— https://github.com/vllm-project/vllm-omni/pull/895 Qwen3-TTS is officially live. Weve open-sourced the full familyVoiceDesign CustomVoice and Basebringing high quality to the open community. - [--] models (0.6B & 1.8B) - Free-form voice design & cloning - Support for [--] languages - SOTA 12Hz tokenizer for high compression -"
X Link 2026-01-22T14:45Z 31K followers, 42.3K engagements

"Quick vLLM Tip ๐Ÿ’Ž Auto Max Model Length Running 10M context models like Llama [--] Scout OOM on startup because default context exceeds GPU memory --max-model-len auto (or -1) vLLM auto-fits the max context length to your GPU. Works with hybrid models TP/PP configs. No more manual tuning"
X Link 2026-01-26T15:00Z 30.5K followers, 11.3K engagements

"๐ŸŽ‰ Congrats @Kimi_Moonshot on Kimi K2.5 a native multimodal agentic model built on 15T vision-language tokens. vLLM day-0 support is ready. If you're building multimodal agents this one's worth evaluating: Vision-native: Pre-trained on 15T vision-language tokens not a bolted-on encoder Code from vision: Feed it UI designs or video workflows get working code Agent Swarm: Breaks complex tasks into parallel sub-agents automatically PR: Serving command: https://github.com/vllm-project/vllm/pull/33131 ๐Ÿฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐Ÿ”น Global SOTA on Agentic Benchmarks:"
X Link 2026-01-27T05:47Z 31K followers, 15.3K engagements

"๐Ÿš€ DeepSeek-OCR [--] introducing Visual Causal Flow from @deepseek_ai learning to read documents the way humans do now running on vLLM โšก with vllm==0.8.5 day-0 support. ๐Ÿง  Replaces fixed raster scanning with learned causal token reordering via DeepEncoder V2. ๐Ÿ“„ [--] visual token compression only [-------] tokens per image. ๐Ÿ“Š 91.09% on OmniDocBench v1.5 (+3.73%) reading order errors cut by 33% repetition down 3040% in production. Model: Github: https://github.com/deepseek-ai/DeepSeek-OCR-2/ https://huggingface.co/deepseek-ai/DeepSeek-OCR-2 https://github.com/deepseek-ai/DeepSeek-OCR-2/"
X Link 2026-01-27T08:27Z 30.7K followers, 14.7K engagements

"๐ŸŽ‰ Congrats to @arcee_ai on releasing Trinity Large with day-0 support in vLLM A 398B sparse MoE with 13B active params trained on 17T+ tokens delivering frontier-level quality with efficient compute. You can serve it now with vLLM ๐Ÿ‘‡ Thanks to the Arcee AI and vLLM community for the collaboration Today were releasing the first weights from Trinity Large our first frontier-scale model in the Trinity MoE family. https://t.co/2zEm4WWMLZ Today were releasing the first weights from Trinity Large our first frontier-scale model in the Trinity MoE family. https://t.co/2zEm4WWMLZ"
X Link 2026-01-28T01:28Z 30.5K followers, 12.4K engagements

"Welcome everyone to the Office Hours ๐Ÿ‘‹ Quick reminder: the Events page on our website has been synced with many SIG meetings including Multimodal Omni AMD torch-compile and CI. Choose the SIG you care about most and get involved. You can also join the discussion in the corresponding #sig-xxx channels on Were excited to have you join the conversation ๐Ÿš€ http://slack.vllm.ai http://vllm.ai/events http://vllm.ai/events vLLM Office Hours this week. Well kick off with our bi-weekly @vllm_project update from core committer @mgoin_ then dive into the vLLM CPU Offloading Connector with the"
X Link 2026-01-28T15:00Z 30.6K followers, [----] engagements

"Nemotron [--] Nano in NVFP4 just dropped from @NVIDIA 4x throughput on B200 (vs FP8-H100) with accuracy preserved via Quantization-Aware Distillation. The checkpoint is already supported by vLLM ๐ŸคThanks NVIDIA vLLM community https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4"
X Link 2026-01-28T17:21Z 31.1K followers, [----] engagements

"Missed Dynamo Day [----] Our session on large-scale LLM serving with vLLM from @simon_mo_ is now available on NVIDIA On-Demand. Covers disaggregated inference Wide-EP for MoE and rack-scale deployments on GB200 NVL72. Thanks @nvidia for hosting Watch recording: #vLLM #NVIDIA https://www.nvidia.com/en-us/on-demand/session/other25-dynamoday06/playlistId=playList-e42aee58-4db9-4ce4-8a6f-c41d8e272d72 https://www.nvidia.com/en-us/on-demand/session/other25-dynamoday06/playlistId=playList-e42aee58-4db9-4ce4-8a6f-c41d8e272d72"
X Link 2026-01-30T03:19Z 30.7K followers, [----] engagements

"๐Ÿš€ vLLM v0.15.0 is here [---] commits from [---] contributors (39 new) Highlights: โšก Async scheduling + Pipeline Parallelism ๐Ÿง  Mamba prefix caching (2x speedup) โšซ Blackwell FP4 65% faster ๐ŸŸฅ AMD RDNA3/RDNA4 consumer GPU support ๐Ÿค– Kimi-K2.5 Molmo2 Eagle2.5-8B VLM EAGLE3 speculative decoding More updates: ๐Ÿ”— https://github.com/vllm-project/vllm/releases/tag/v0.15.0 https://github.com/vllm-project/vllm/releases/tag/v0.15.0"
X Link 2026-01-31T02:14Z 31.1K followers, 38.7K engagements

"๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Congrats to @StepFun_ai on releasing Step [---] Flash and day-0 support is ready in vLLM A 196B MoE that activates only 11B params per token giving you frontier reasoning with exceptional efficiency. Highlights: 74.4% SWE-bench Verified 51.0% Terminal-Bench [---] 256K context with 3:1 Sliding Window Attention for cost-efficient long context Built for coding agents and long-horizon agentic tasks Check out our detailed deployment recipe below ๐Ÿ‘‡ https://docs.vllm.ai/projects/recipes/en/latest/StepFun/Step-3.5-Flash.html โšก Step [---] Flash is coming: Fast Enough to Think. Reliable Enough to Act"
X Link 2026-02-02T17:22Z 31K followers, 16.5K engagements

"๐Ÿ”ฅMultimodal models are rapidly gaining traction in production AI and vLLM is often the go-to inference engine for running them at scale. Check out this new @MLCommons @MLPerf benchmark with Qwen3-VL + vLLM on @Shopify product catalog data๐Ÿ‘‡ ๐Ÿš€ NEW: @MLPerf Inference v6.0 debuts Qwen3-VL + @Shopify Product Catalog benchmark 40M products daily. Real production data. First Qwen model in MLPerf. Submit by Feb [--] [----] https://t.co/tWhbcaxdVo #MLPerf #VLM #Shopify #MLCommons https://t.co/001OEnpyLV ๐Ÿš€ NEW: @MLPerf Inference v6.0 debuts Qwen3-VL + @Shopify Product Catalog benchmark 40M products"
X Link 2026-02-02T22:35Z 31K followers, [----] engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing