@vllm_project vLLMvLLM posts on X about inference, llm, ai, agentic the most. They currently have [------] followers and [---] posts still getting attention that total [-----] engagements in the last [--] hours.
Social category influence technology brands 28.44% stocks 22.02% travel destinations 2.75% finance 2.75% countries 1.83% social networks 0.92%
Social topic influence inference #56, llm #829, ai 11.01%, agentic 11.01%, the first 7.34%, native #781, community 6.42%, strong 6.42%, up to 5.5%, performance 5.5%
Top accounts mentioned or mentioned by @kimimoonshot @pytorch @vllmproject @deepseekai @nvidia @alibabaqwen @ai_bridge_japan @nvidiaaidev @huggingface @minimaxai @redhat_ai @aiatmeta @redhatai @mistralai @aiatamd @sparkycollier @anushelangovan @dtransposed @alibaba_qwen @bygregorr
Top posts by engagements in the last [--] hours
"๐ DeepSeek-OCR the new frontier of OCR from @deepseek_ai exploring optical context compression for LLMs is running blazingly fast on vLLM โก (2500 tokens/s on A100-40G) powered by vllm==0.8.5 for day-0 model support. ๐ง Compresses visual contexts up to [--] while keeping 97% OCR accuracy at [--]. ๐ Outperforms GOT-OCR2.0 & MinerU2.0 on OmniDocBench using fewer vision tokens. ๐ค The vLLM team is working with DeepSeek to bring official DeepSeek-OCR support into the next vLLM release making multimodal inference even faster and easier to scale. ๐ #vLLM #DeepSeek #OCR #LLM #VisionAI #DeepLearning"
X Link 2025-10-20T11:31Z 31.6K followers, 1.5M engagements
"Just landed FlashMLA in vLLM and it is already boosting output throughput 2-16% We expect more improvements in the coming days as we continue to optimize the code path. https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs optimized for variable-length sequences and now in production. โ
BF16 support โ
Paged KV cache (block size 64) โก [----] GB/s memory-bound & [---] TFLOPS https://github.com/vllm-project/vllm/pull/13747 ๐ Day [--] of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our"
X Link 2025-02-27T06:15Z 31.4K followers, 49.5K engagements
"๐ Last week we hosted the vLLM Beijing Meetup with top players like Tencent @TencentHunyuan Huawei @Huawei ByteDance @ByteDanceOSS Ant Group @AntGroup Moonshot AI @Kimi_Moonshot & Xiaomi @XiaoMi_AI ๐ก Discover how industry leaders are using vLLM to power real-world high-performance inference systems why they choose vLLM and how they improve upon vLLM: ๐ WeChat Post ๐บ Livestream Replay ๐ Slides #vLLM #AIInference #LLMInfra https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF https://www.chaspark.com/#/live/1166916873711665152"
X Link 2025-08-07T14:31Z 31.6K followers, [----] engagements
"Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug We have collaborated with the @nvidia team to investigate how cuda core dump can help check out the blogpost to learn more https://blog.vllm.ai/2025/08/11/cuda-debugging.html https://blog.vllm.ai/2025/08/11/cuda-debugging.html"
X Link 2025-08-13T03:55Z 31.4K followers, 34K engagements
"Amazing blogpost from @gordic_aleksa explaining internals of vLLM๐ New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of"
X Link 2025-09-01T16:06Z 31.4K followers, 32.8K engagements
"๐ New in vLLM: dots.ocr ๐ฅ A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM ๐ Single end-to-end parser for text tables (HTML) formulas (LaTeX) and layouts (Markdown) ๐ Supports [---] languages with robust performance on low-resource docs โก Compact 1.7B VLM but achieves SOTA results on OmniDocBench & dots.ocr-bench โ
Free for commercial use Deploy it in just two steps: uv pip install vllm --extra-index-url vllm serve rednote-hilab/dots.ocr --trust-remote-code Try it today and bring fast accurate OCR to your pipelines. Which models would you like"
X Link 2025-09-28T12:20Z 31.4K followers, 70.6K engagements
"๐ The RL community keeps pushing boundaries from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild but thats exactly what PipelineRL makes work. vLLM is proud to power this kind of modular cutting-edge RL innovation. Give it a try and share your thoughts I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences Just update the weights and"
X Link 2025-10-05T07:04Z 31.4K followers, 89K engagements
"๐ Congrats to @Kimi_Moonshot vLLM Day-0 model expands Now supporting Kimi Linear hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: [----] perf + [----] speedup - Up to [--] faster decoding & [---] faster TPOT (1M tokens) - 75% KV cache reduction ๐ก Optimized for high-performance long-context LLM serving Fast Deployment in vLLM๐ Kimi Linear Tech Report is dropped ๐ https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performanceready to serve as a drop-in replacement for full attention featuring our"
X Link 2025-10-30T16:58Z 31.4K followers, 39.8K engagements
"๐ Congrats @Alibaba_Qwen on the Qwen3-ASR release vLLM has day-0 support. [--] languages 2000x throughput on the 0.6B model singing voice recognition and SOTA accuracy on the 1.7B. Serve it now in vLLM ๐ Learn more: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-ASR.html Qwen3-ASR and Qwen3-ForcedAligner are now open source production-ready speech models designed for messy real-world audio with competitive performance and strong robustness. 52languages&dialectswithautolanguageID(30languages+22dialects/accents) Robustin https://t.co/q7RWjJFXgH"
X Link 2026-01-29T13:25Z 31.3K followers, 49K engagements
"Quick vLLM Tip ๐ Decode Context Parallel Using -tp but KV cache duplicated across GPUs Use -dcp size to shard KV cache along the token dimension. Core principles: [--]. TP shards KV cache along kv-heads (H dimension) [--]. When tp_size H KV cache gets duplicated tp_size/H times [--]. DCP shards along tokens (T dimension) reducing duplication [--]. dcp_size range: [--] tp_size/H no extra GPUs needed [--]. Interleaving strategy: future tokens naturally sharded Trade-off: larger dcp = less duplication more communication. Works with both MLA and GQA models. Docs:"
X Link 2026-01-31T16:25Z 31.6K followers, [----] engagements
"๐ vLLM-Omni v0.14.0 is officially released our first stable release [---] commits from 70+ contributors (23 new) ship the multimodal stack for production. Highlights: โก Async chunk pipeline overlap ๐ฃ Qwen3-TTS with online serving ๐จ Diffusion LoRA (PEFT-compatible) ๐ง DiT layerwise CPU offloading ๐ XPU / ROCm / NPU backends New models: ๐ฅฏ Bagel (multi-stage pipeline) ๐ต Stable Audio Open ๐ผ GLM-Image FLUX.1-dev FLUX.2-klein APIs: ๐ธ /v1/images/edit endpoint ๐ฉบ /health & /v1/models for diffusion mode Performance: โก Torch compile for diffusion ๐ฅ SharedFusedMoE for Qwen3-Omni ๐ง TeaCache for"
X Link 2026-02-01T10:56Z 31.2K followers, 25.4K engagements
"๐ Congrats to @Alibaba_Qwen on releasing Qwen3-Coder-Next and day-0 support is ready in vLLM 0.15.0 An 80B MoE with only 3B active params matching models 1020x larger. Built for coding agents and local development. Verified on NVIDIA GPUs. Recipe below ๐ ๐ IntroducingQwen3-Coder-Next an open-weight LM built for coding agents & local development. Whats new: ๐ค Scaling agentic training:800K verifiable tasks + executable envs ๐ EfficiencyPerformance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and https://t.co/P7BmZwdaQ9 ๐ IntroducingQwen3-Coder-Next an"
X Link 2026-02-03T17:44Z 31.4K followers, 28.9K engagements
"๐ vLLM community + @nvidia pushed gpt-oss-120b performance on Blackwell GPUs to new heights: โก +38% max throughput ๐ฏ +13% min latency ๐ Entire Pareto frontier improved Key ingredients: FlashInfer integration torch.compile kernel fusions async scheduling and stream interval optimizations. Deep dive + deployment recipes: Thanks to the teams at @NVIDIAAI @RedHat_AI @AIatMeta and the vLLM community for the collaboration ๐ https://blog.vllm.ai/2026/02/01/gpt-oss-optimizations.html https://blog.vllm.ai/2026/02/01/gpt-oss-optimizations.html"
X Link 2026-02-04T01:29Z 31.4K followers, 26K engagements
"30M downloads for NVIDIA Nemotron on Hugging Face. Huge milestone. ๐ Really appreciate the ongoing @NVIDIAAIDev x vLLM collaboration helping Nemotron run fast and serve developers efficiently at scale. 30M downloads and counting for the NVIDIA Nemotron family on @huggingface ๐ค We're grateful for the incredible community that has made this possible. Get started with Nemotron: https://t.co/4AtDcnOCXS https://t.co/cHRF3qbfVi 30M downloads and counting for the NVIDIA Nemotron family on @huggingface ๐ค We're grateful for the incredible community that has made this possible. Get started with"
X Link 2026-02-04T02:38Z 31.2K followers, [----] engagements
"๐ Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year's Eve day-0 support is ready in vLLM Qwen3.5 is a multimodal MoE with hybrid Mamba-attention architecture 397B total params only 17B active. What makes it interesting for inference: ๐ง Gated Delta Networks + sparse MoE high throughput low latency lower cost ๐ [---] languages and dialects supported out of the box ๐ One model for both text and vision no separate VL pipeline needed Verified on NVIDIA GPUs. Recipes for Docker pip and multi-node deployment ๐ #vLLM #Qwen #OpenSource #Inference"
X Link 2026-02-16T09:47Z 31.6K followers, [----] engagements
"๐ Congrats to @intern_lm on Intern-S1-Pro day-0 support in vLLM ๐ฌ Trillion-scale MoE for scientific reasoning: 1T total params [---] experts 22B activated per token. State-of-the-art across AI4Science domains. PR: Serving command (โ
Verified on NVIDIA GPUs): https://github.com/vllm-project/vllm/pull/33636 ๐Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general https://t.co/cKni28WwQT"
X Link 2026-02-04T13:49Z 31.6K followers, [----] engagements
"๐๐๐ vLLM on NVIDIA GB200: 26.2K prefill TPGS 10.1K decode TPGS for DeepSeek R1/V3. ๐ 3-5x throughput vs H200 - with half the GPUs Key optimizations: - NVFP4 GEMM for MoE experts - FP8 GEMM for MLA - Kernel fusion (RoPE+Quant+Q Write) - Weight offloading v2 with async prefetch Thanks to the @AIatMeta and @NVIDIAAIDev teams for the collaboration ๐ ๐ Blog: https://blog.vllm.ai/2026/02/03/dsr1-gb200.html https://blog.vllm.ai/2026/02/03/dsr1-gb200.html"
X Link 2026-02-04T17:48Z 31.6K followers, 37.3K engagements
"Congrats to @MistralAI on releasing Voxtral Mini 4B Realtime ๐ Day-0 support in vLLM A 4B streaming ASR model achieving 500ms latency while matching offline model accuracy supporting [--] languages. vLLM's new Realtime API /v1/realtime provides audio streaming - optimized for voice assistants live subtitles and meeting transcription Thanks to the close collaboration between the vLLM community and @MistralAI for making this production-grade support possible ๐ค ๐Model & Usage: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602 Voxtral Realtime is built for voice agents and live"
X Link 2026-02-04T17:51Z 31.6K followers, 30.3K engagements
"๐The vLLM-Omni paper is now on arXiv ๐ Congrats to the team on documenting the system design for serving any-to-any multimodal models stage-based decomposition for complex pipelines (LLMs + diffusion + encoders) per-stage batching and flexible GPU allocation. Repo: https://github.com/vllm-project/vllm-omni https://arxiv.org/abs/2602.02204 https://github.com/vllm-project/vllm-omni https://arxiv.org/abs/2602.02204"
X Link 2026-02-05T03:36Z 31.6K followers, 16.5K engagements
"Great writeup from @AI21Labs on scaling vLLM for high-throughput bursty workloads. TL;DR: systematic config tuning + queue-based autoscaling = 2x throughput from the same GPUs. ๐ Useful for anyone running vLLM in production with variable traffic patterns. Thanks to the @AI21Labs team for publishing the full engineering writeup. ๐ ๐ https://www.ai21.com/blog/scaling-vllm-without-oom/ 1/5 Go Big or Go OOM: The Art of Scaling vLLM ๐ฏ. We doubled throughput and cut latency in half-same GPUs just better vLLM config then added smart autoscaling to handle traffic bursts. Here's what we learned"
X Link 2026-02-10T12:17Z 31.6K followers, 14.9K engagements
"Part [--] of our AI21 Labs x vLLM series ๐ After scaling insights this one goes deep into debugging: how a 1-in-1000 gibberish failure in vLLM + Mamba was reproduced traced and fixed upstream. Key lesson: even with correct kernel math request classification timing can still corrupt state under memory pressure. Thanks again to the @AI21Labs team for sharing the full investigation and fix details. ๐ https://www.ai21.com/blog/vllm-debugging-mamba-bug/ https://www.ai21.com/blog/vllm-debugging-mamba-bug/"
X Link 2026-02-10T12:57Z 31.6K followers, [----] engagements
"โกStreaming input + ๐Realtime WebSocket API built in collaboration with @Meta and @MistralAI. First among all popular open-source LLM inference engines. Check out the design and examples in http://blog.vllm.ai/2026/01/31/streaming-realtime.html http://blog.vllm.ai/2026/01/31/streaming-realtime.html"
X Link 2026-02-11T01:27Z 31.6K followers, 11.5K engagements
"๐ vLLM just hit 70K GitHub stars ๐ The engine has kept evolving fast since the last milestone. We've been pushing hard on large-scale serving production-grade multi-node support on NVIDIA Blackwell with WideEP and expert parallelism making it practical to serve the biggest models at scale. More models more hardware async scheduling for higher throughput real-time streaming for speech and audio and a growing multimodal story across text vision video and voice. Huge thanks to our sponsors our 2100+ contributors friends at @PyTorch @huggingface Transformers and the model labs we work closely"
X Link 2026-02-11T08:35Z 31.6K followers, [----] engagements
"๐ฅCongrats to @Zai_org on launching GLM-5 744B parameters (40B active) trained on 28.5T tokens integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: ๐ DeepSeek Sparse Attention for efficient long-context serving โก MTP speculative decoding โ Tool calling + thinking mode Recipe with serving configs and benchmarks: ๐ https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html"
X Link 2026-02-11T18:44Z 31.6K followers, 41.9K engagements
"Exciting evolution for the Pytorch Foundation family ๐ Congrats to @matthew_d_white on stepping into the CTO role at PyTorch Foundation and welcome @sparkycollier as the new Executive Director As the foundation scales to support @PyTorch @vllm_project @DeepSpeedAI and @raydistributed under one umbrella stronger leadership means stronger open source AI infrastructure for everyone. Looking forward to what's ahead ๐ Over the past two years PyTorch Foundation has evolved into an umbrella foundation hosting PyTorch @vllm_project @DeepSpeedAI & @raydistributed with expanded governance global"
X Link 2026-02-12T12:26Z 31.6K followers, [----] engagements
"๐ DeepSeek R1 on GB300 with vLLM: 22.5K prefill TGS and 3K decode TGS per GPU an 8x prefill and 10-20x mixed-context improvement over Hopper. DeepSeek V3.2 on [--] GPUs (NVFP4 + TP2): 7.4K prefill TGS and 2.8K decode TGS. Key recipe: โก NVFP4 weights from HuggingFace โก FlashInfer FP4 MoE kernel (VLLM_USE_FLASHINFER_MOE_FP4=1) โก [--] GPUs is all you need 288GB HBM per GPU EP vs TP MTP tuning disaggregated prefill all covered. ๐ Collab with @daocloud_io & @verdacloud ๐ Blog: https://blog.vllm.ai/2026/02/13/gb300-deepseek.html https://blog.vllm.ai/2026/02/13/gb300-deepseek.html"
X Link 2026-02-13T13:56Z 31.6K followers, 11.9K engagements
"๐ Congrats @MiniMax_AI on M2.5 vLLM has day-0 support SOTA coding (80.2% SWE-Bench Verified) agentic search (76.3% BrowseComp) trained on 200k+ real-world RL environments. 37% faster than M2.1 matching Opus [---] speed. ๐ โ
Verified on NVIDIA GPUs. Recipe (Docker & pip): https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments it delivers SOTA performance in coding agentic tool use search and office workflows. Hugging Face: https://t.co/Wxksq9BB7t"
X Link 2026-02-13T14:06Z 31.6K followers, 23.2K engagements
"๐ฅExcited to see SkyRL bringing Tinker to local GPUs. Standardizing training APIs lowers the barrier for research and infrastructure innovation. vLLM is proud to power the inference layer behind high-throughput RL training. ๐ SkyRL now implements the Tinker API. Now training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2 Megatron and vLLM backends. Blog: https://t.co/GAtW81jM38 ๐งต https://t.co/shLZSjdi5x SkyRL now implements the Tinker API. Now training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's"
X Link 2026-02-14T07:38Z 31.6K followers, [----] engagements
"vLLM is coming to Hong Kong join us on March [--] for a full-day meetup on LLM inference multimodal serving and multi-hardware optimization. Talks & workshops from @vllm_project core team @RedHat_AI @AIatAMD MetaX and @MiniMax_AI and more. ๐ Register: https://www.vantagemind.com/events/vLLM/260307/vLLM-HK-Meetup_vLLM.html https://www.vantagemind.com/events/vLLM/260307/vLLM-HK-Meetup_vLLM.html"
X Link 2026-02-14T09:15Z 31.6K followers, [----] engagements
"๐ Congrats to @Alibaba_Qwen on releasing Qwen3.5 on Chinese New Year's Eve day-0 support is ready in vLLM Qwen3.5 is a multimodal MoE with Gated Delta Networks architecture 397B total params only 17B active. What makes it interesting for inference: ๐ง Gated Delta Networks + sparse MoE high throughput low latency lower cost ๐ [---] languages and dialects supported out of the box ๐ One model for both text and vision no separate VL pipeline needed Verified on NVIDIA GPUs. Recipes for Docker pip and multi-node deployment ๐ #vLLM #Qwen #OpenSource #Inference"
X Link 2026-02-16T10:17Z 31.6K followers, 45.4K engagements
"Join us on March [--] for an evening of deep technical talks on inference IDE integrations omni-modality and Kubernetes-scale serving. ๐
March [--] 5:00 PM - 9:00 PM (GMT+1) Warsaw Poland See you in Warsaw ๐ Warsaw friends vLLM Inference Meetup is happening March [--]. @jetbrains @RedHat and @nvidia are bringing @vllm_project maintainers + real demos plus an optional hands-on workshop earlier in the day. If you build or run inference come meet the people shipping it. ๐ Warsaw friends vLLM Inference Meetup is happening March [--]. @jetbrains @RedHat and @nvidia are bringing @vllm_project"
X Link 2026-02-18T02:20Z 31.6K followers, [----] engagements
"Great work We love how @vllm_project is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked This is a small feature we implemented to make RLHF smoother with vLLM. Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini) ๐งต https://t.co/I3UXLGKFNb Our research interns present: Kevin-32B = K(ernel D)evin It's the first"
X Link 2025-05-07T01:45Z 31.3K followers, 12.9K engagements
""inference uses @vllm_project" Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2 Releasing INTELLECT-2: Were open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: Detailed Technical Report INTELLECT-2 model checkpoint https://t.co/iHDDHRyKN2"
X Link 2025-05-12T05:39Z 31.2K followers, 10.7K engagements
"uv pip install -U vLLM The latest release features [---] commits from [---] contributors. vLLM is now ready for @nvidia Blackwell with the latest @PyTorch [---] upgrade. Huge thanks to @NVIDIAAIDev and @ye_combinator for the CUTLASS and FlashInfer kernels"
X Link 2025-05-28T01:54Z 31.2K followers, 17.7K engagements
"Congrats on the launch vLLM is proud to support the new Qwen3 embedding models check it out ๐๐ป https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series setting new standards in multilingual text embedding and relevance ranking โจ Highlights: โ
Available in 0.6B / 4B / 8B versions โ
Supports [---] languages โ
State-of-the-Art performance on MMTEB MTEB https://t.co/qNu0rswSol https://github.com/QwenLM/Qwen3-Embeddingtab=readme-ov-file#vllm-usage ๐ Proud to introduce the Qwen3-Embedding and Qwen3-Reranker Series"
X Link 2025-06-05T15:45Z 31.3K followers, 12.2K engagements
"uv pip install -U vllm --extra-index-url --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today ๐ฎ https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh https://wheels.vllm.ai/0.9.1rc1 Announcing Magistral our first reasoning model designed to excel in domain-specific transparent and multilingual reasoning. https://t.co/SwKEEtCIXh"
X Link 2025-06-10T14:37Z 31.2K followers, 11.7K engagements
"๐ Look what just arrived at @BerkeleySky ๐ A shiny MI355X system. Huge thanks to @AMD for supporting open source and we are looking forward to getting it set up in the next few days"
X Link 2025-06-10T17:20Z 31.3K followers, 20.7K engagements
"Thank you @AMD @LisaSu @AnushElangovan for Advancing AI together with @vllm_project We look forward to the continued partnership and pushing the boundary of inference"
X Link 2025-06-13T23:54Z 31.2K followers, 17.3K engagements
"Congrats on the launch vLLM is proud to support this great model on day [--] looking forward to the following releases Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency: https://t.co/bGfDlZA54n Day 1/5 of #MiniMaxWeek: Were open-sourcing MiniMax-M1 our latest LLM setting new standards in long-context reasoning. - Worlds longest context window: 1M-token input 80k-token output -"
X Link 2025-06-16T16:14Z 31.3K followers, 19K engagements
"vLLM has just reached 50K github stars Huge thanks to the community๐ Together let's bring easy fast and cheap LLM serving for everyoneโ๐ป"
X Link 2025-06-19T05:25Z 31.2K followers, 18.3K engagements
""The second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window" Checkout the blog post from @minimax_ai on how the model is implemented on vLLM and how you can run this model efficiently https://blog.vllm.ai/2025/06/30/minimax-m1.html MiniMax launches their first reasoning model: MiniMax M1 the second most intelligent open weights model after DeepSeek R1 with a much longer 1M token context window @minimax_ai M1 is based on their Text-01 model (released [--] Jan 2025) - an MoE with 456B total and 45.9B active https://t.co/JltMYrm0te"
X Link 2025-07-01T04:16Z 31.2K followers, 13.9K engagements
"We genuinely want to solve this problem. As many (@Rxday000 @samsja19 @danielhanchen @_EldarKurtic and more) chimed in the reason includes attention kernels matmul reduction order precisions in various operators and more horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs https://t.co/PWgVCMCcgh"
X Link 2025-07-06T21:30Z 31.2K followers, 37.3K engagements
"vLLM runs on free-threaded Python A group of engineers from @Metas Python runtime language team has shown that its possible to run vLLM on the nogil distribution of Python. Were incredibly excited to embrace this future technique and be early adopters ๐"
X Link 2025-07-08T05:06Z 31.2K followers, 50.6K engagements
"@Kimi_Moonshot just released a trillion-parameter model with great agentic capability and it is already supported in vLLM Have a try with a simple command and check the doc for more advanced deployment๐ ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench Verified Tau2 & AceBench among open models ๐นStrong in coding and agentic tasks ๐ค Multimodal & thought-mode not supported for now With Kimi K2 advanced agentic intelligence https://t.co/PlRQNrg9JL ๐ Hello Kimi K2 Open-Source Agentic Model ๐น 1T total / 32B active MoE model ๐น SOTA on SWE Bench"
X Link 2025-07-11T15:06Z 31.4K followers, 34.4K engagements
"Thanks for the great write-up ๐ Prefix caching is critical for agentic workflows like @manusai and vLLM makes it seamless. โ
prefix caching is enabled by default with an efficient implementation โ
Append-only context Cache hit heaven Context engineering FTW ๐ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ After four overhauls and millions of real-world sessions here are the lessons we learned about context engineering for AI agents: https://t.co/Ql014rEzBQ"
X Link 2025-07-19T14:20Z 31.2K followers, 15K engagements
"The @huggingface Transformers @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers you can now run it directly with vLLM. Great work @RTurganbay ๐ https://github.com/vllm-project/vllm/pull/20543 https://github.com/vllm-project/vllm/pull/20543"
X Link 2025-07-22T20:32Z 31.2K followers, 22.3K engagements
"โ
Try out @Alibaba_Qwen [--] Coder on vLLM nightly with "qwen3_coder" tool call parser Additionally vLLM offers expert parallelism so you can run this model in flexible configurations where it fits. Qwen3-Coder is here โ
Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves https://t.co/Z8HfyrVScE Qwen3-Coder is here โ
Were releasing Qwen3-Coder-480B-A35B-Instruct our most powerful open agentic code model to"
X Link 2025-07-22T22:06Z 31.2K followers, 34.2K engagements
"This amazing Attention-FFN disaggregation implementation from @StepFun_ai achieves decoding throughput of up to [----] tokens per second per GPU under 50ms TPOT SLA for their 321B-A38B MoE model Step3 served with H800 The implementation is based on vLLM and we are working together to bring it to the public Kudos to @StepFun_ai ๐ Check out their tech report at . https://github.com/stepfun-ai/Step3 https://github.com/stepfun-ai/Step3"
X Link 2025-07-26T01:26Z 31.2K followers, 33.6K engagements
"๐ฅ vLLM @ PyTorch Conference [----] ๐ฅ Were excited to share that [--] talks at this years PyTorch Conference will feature vLLM Topics include: Easy & Fast LLM Serving Open-Source Post-Training Stack Scaling Online LLM Training AMD GPU support via Triton vllm-triton backend performance Stay tuned & come say hi ๐ #vLLM #PyTorch #LLM #AI #opensource ๐ ICYMI: The #PyTorchConf schedule is now live https://t.co/YSAdVaiWRk Were talking [--] days of cutting-edge talks on #LLM scaling real-time inference model optimization & more straight from the #PyTorch community. ๐ Oct [----] San Francisco ๐ Register:"
X Link 2025-07-31T07:31Z 31.2K followers, 29.1K engagements
"Thank you @OpenAI for open-sourcing these great models ๐ Were proud to be the official launch partner for gpt-oss (20B & 120B) now supported in vLLM ๐ โก MXFP4 quant = fast & efficient ๐ Hybrid attention (sliding + full) ๐ค Strong agentic abilities ๐ Easy deployment ๐๐ป Check out the blog and recipes for more details ๐ฅ #vLLM #gptOSS #openai https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html https://blog.vllm.ai/2025/08/05/gpt-oss.html Our open models are here. Both of them. https://t.co/9tFxefOXcg https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html"
X Link 2025-08-05T17:31Z 31.2K followers, 44.3K engagements
"๐ we care a lot about correctness ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified if you run into any correctness issue on vLLM we would love to know and debug them Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. Were working with inference providers to make sure gpt-oss performs at its best everywhere and wed love your feedback Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across"
X Link 2025-08-06T03:51Z 31.2K followers, 66.7K engagements
"๐ Amazing community project vLLM CLI a command-line tool for serving LLMs with vLLM: โ
Interactive menu-driven UI & scripting-friendly CLI โ
Local + HuggingFace Hub model management โ
Config profiles for perf/memory tuning โ
Real-time server & GPU monitoring โ
Error logs & recovery ๐ฆ Install in one line: pip install vllm-cli GitHub: ๐ Would you like to see these features in vLLM itself Try it out & share feedback https://github.com/Chen-zexi/vllm-cli https://github.com/Chen-zexi/vllm-cli"
X Link 2025-08-17T08:52Z 31.2K followers, 72.8K engagements
"๐ GLM-4.5 meets vLLM @Zai_org 's latest GLM-4.5 & GLM-4.5V models bring hybrid reasoning coding & intelligent agent capabilitiesnow fully supported in vLLM for fast efficient inference on NVIDIA Blackwell & Hopper GPUs Read more ๐ https://blog.vllm.ai/2025/08/19/glm45-vllm.html https://blog.vllm.ai/2025/08/19/glm45-vllm.html"
X Link 2025-08-19T09:10Z 31.2K followers, 16.9K engagements
"๐ Exciting news: DeepSeek-V3.1 from @deepseek_ai now runs on vLLM ๐ง Seamlessly toggle Think / Non-Think mode per request โก Powered by vLLMs efficient serving scale to multi-GPU with ease ๐ Perfect for agents tools and fast reasoning workloads ๐ Guide & examples: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_1.html Introducing DeepSeek-V3.1: our first step toward the agent era ๐ ๐ง Hybrid inference: Think & Non-Think one model two modes โก Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 ๐ Stronger agent skills: Post-training"
X Link 2025-08-21T17:20Z 31.2K followers, 23.6K engagements
"Wow glad to see vLLM powers @jiawzhao 's DeepConf work impressive results on AIME [----] Do you think this sampling control makes sense Have a try and leave a comment in that PR to let us know https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf: Deep Think with Confidence ๐ First method to achieve 99.9% on AIME [----] with open-source models Using GPT-OSS-120B even without tools we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong https://t.co/MlhDUKmawH https://github.com/vllm-project/vllm/pull/23201 Introducing DeepConf:"
X Link 2025-08-23T15:31Z 31.3K followers, 48.3K engagements
"๐ vLLM Shanghai Meetup Recap ๐ Last weekend we gathered with the community in Shanghai to dive into: Contributing to vLLM Distributed inference ERNIE [---] integration Mooncake + LMCache MetaX hardware support The community is pushing vLLM to new levels of performance scalability & adaptability. ๐ Event notes: ๐ Slides: https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg"
X Link 2025-08-25T08:59Z 31.2K followers, 12.1K engagements
"๐ LLM Compressor v0.7.0 is here This release brings powerful new features for quantizing large language models including transform support (QuIP SpinQuant) mixed precision compression improved MoE handling with Llama4 support and more. Full blog: More info below ๐ https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap"
X Link 2025-08-26T20:02Z 31.2K followers, 16.8K engagements
"๐ vLLM now supports Kwai Keye-VL-1.5 With sharper video ๐น & image ๐ผ comprehension stronger reasoning and an extended 128K context length this model unlocks richer conversations and more complex tasks than ever before. Upgrade to the nightly build and experience it today Check out for more details. #AI #vLLM #KeyeVL https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B"
X Link 2025-09-01T13:36Z 31.2K followers, 10.7K engagements
"vLLM is proud to support the great Kimi update from @Kimi_Moonshot better tool-calling longer context and more Check the deployment guide at ๐ฅ https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/docs/deploy_guidance.md Kimi K2-0905 update ๐ - Enhanced coding capabilities esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g. Claude Code Roo Code etc) ๐ Weights & code: https://t.co/83sQekosr9 ๐ฌ Chat with new Kimi https://t.co/mkOuBMwzpw"
X Link 2025-09-05T03:26Z 31.4K followers, 18.6K engagements
"The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost (after more proofreading and clarifications) Looking forward to future series of tech deep dive blogposts๐ https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work Took me a while to get this level of understanding of the codebase and then to write up https://t.co/F2wsYaFO7q https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html New in-depth blog"
X Link 2025-09-09T01:31Z 31.3K followers, 48.2K engagements
"Wow thanks to @charles_irl you can understand internals of vLLM with a live notebook from @modal ๐ฅฐ I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM I had already planned to spend the day reading @gordic_aleksa's "Inside vLLM" blog post. That turned out to be an incredible fit for @modal Notebooks released today https://t.co/QKX1g9smdp https://t.co/npjf7yrljM"
X Link 2025-09-10T09:27Z 31.2K followers, 30.7K engagements
"โก Efficient weight updates for RL at trillion-parameter scale ๐ก Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: Broadcast weight sync for 1T params in 20s across 1000s of GPUs Dynamic P2P updates for elastic clusters Optimized pipeline w/ overlapped H2D broadcast & reload Open source & ready for large-scale RL with vLLM ๐ Introducing checkpoint-engine: our open-source lightweight middleware for efficient in-place weight updates in LLM inference engines especially effective for RL. โ
Update a 1T model on thousands of GPUs in 20s โ
Supports both"
X Link 2025-09-10T17:06Z 31.2K followers, 24K engagements
"Thank you @cHHillee for the great explanation and demo of how to implement deterministic inference on vLLM https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is Defeating Nondeterminism in LLM Inference We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to https://t.co/jMFL3xt67C https://github.com/vllm-project/vllm/pull/24583 Today Thinking Machines Lab is launching our research blog Connectionism. Our first blog post is"
X Link 2025-09-10T17:41Z 31.2K followers, 16.3K engagements
"Deep dive into optimizing weight transfer step by step and improving it 60x [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH [---] seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT https://t.co/PAUqY43epH"
X Link 2025-09-10T20:46Z 31.3K followers, 18.5K engagements
"Welcome Qwen3-Next You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only 3B activated per token 10x cheaper training 10x faster inference than Qwen3-32B.(esp. @ 32K+ context) ๐นHybridArchitecture:GatedDeltaNet+GatedAttentionbestofspeed& https://t.co/yO7ug721U6 https://blog.vllm.ai/2025/09/11/qwen3-next.html ๐ Introducing Qwen3-Next-80B-A3B the FUTURE of efficient LLMs is here ๐น 80B params but only"
X Link 2025-09-11T19:38Z 31.4K followers, 61.7K engagements
"v0.10.2 marks the first release with official aarch64 support for vLLM You can now install vLLM directly onto @nvidia 's GB200. Along with the PyPI release our docker image is also multi-platform so pulling the right image just works. More perf enhancements on the way"
X Link 2025-09-16T00:49Z 31.3K followers, 17.9K engagements
"Congrats to @deepseek_ai DeepSeek-R1 was published in Nature yesterday as the cover article and vLLM is proud to have supported its RL training and inference๐ฅฐ"
X Link 2025-09-18T02:44Z 31.2K followers, 214.3K engagements
"Pro Tip๐กFast and simple way to deploy DeepSeek-V3.1-Terminus with vLLM โก Run it with: vllm serve deepseek-ai/DeepSeek-V3.1-Terminus -tp [--] -dcp [--] (as simple as appending -dcp [--] after -tp 8) Thanks to the @Kimi_Moonshot team vLLM 0.10.2 adds Decode Context Parallel (DCP) support: โ
Cuts KV cache duplication by sharding across GPUs โ
[--] larger KV cache โ
[--] throughput gain on single-node H200 Perfect for KV-cache hungry tasks (RL offline data generation). More blogposts diving into DCP are coming soonand optimizations for general GQA models are on the way ๐ #vLLM #DeepSeek #AIInfra ๐"
X Link 2025-09-24T11:35Z 31.3K followers, 44K engagements
"How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of [---] per token (vs. [---] for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA. ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ Built on V3.1-Terminus it debuts DeepSeek Sparse Attention(DSA) for faster more efficient training & inference on long context. ๐ Now live on App Web and API. ๐ฐ API prices cut by 50%+ 1/n ๐ Introducing DeepSeek-V3.2-Exp our latest experimental model โจ"
X Link 2025-09-29T10:59Z 31.2K followers, 103.3K engagements
"Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai vLLM is here to help We have verified that it works on H200 machines and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes for more details ๐ Note: currently the PR is still pending but you can use our pre-compiled wheels to build directly and use the model We will push the model into main branch very soon and add many more optimizations. Stay tuned https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html How does @deepseek_ai Sparse Attention (DSA) work It has [--] components: the"
X Link 2025-09-29T14:05Z 31.2K followers, 12.5K engagements
"Keeping BERT alive in vLLM via transformers Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward (1/2) https://t.co/uIlGy2loCE"
X Link 2025-10-02T17:40Z 31.3K followers, 17.8K engagements
"๐ vLLM x MinerU: Document Parsing at Lightning Speed Were excited to see MinerU fully powered by vLLM bringing ultra-fast accurate and efficient document understanding to everyone. โก Powered by vLLMs high-throughput inference engine MinerU [---] delivers: Instant parsing no waiting Deeper understanding for complex docs Optimized cost even consumer GPUs can fly Experience the new speed of intelligence: ๐ #vLLM #MinerU #AI #LLM #DocumentParsing #AIresearch https://github.com/opendatalab/MinerU MinerU [---] has arrived with demo on @huggingface https://t.co/KZyL6TJDSe"
X Link 2025-10-11T11:11Z 31.2K followers, 71.7K engagements
"๐ vLLM just hit 60K GitHub stars ๐ From a small research idea to powering LLM inference everywhere across NVIDIA AMD Intel Apple TPUs and more vLLM now supports almost all major text-generation models and native RL pipelines like TRL Unsloth Verl and OpenRLHF. Huge thanks to our amazing community and friends at @PyTorch @huggingface Transformers and model vendors from @AIatMeta Llama to @OpenAI GPT-OSS @Alibaba_Qwen Qwen @deepseek_ai DeepSeek and @Kimi_Moonshot Kimi and many others (sorry we ran out of space) for making this ecosystem thrive. โค Let's head to the next chapter of efficient"
X Link 2025-10-13T13:13Z 31.2K followers, 39.1K engagements
"Announcing the completely reimagined vLLM TPU In collaboration with @Google we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. ๐ What's New - JAX + Pytorch: Run PyTorch models on TPUs with no code changes now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program Multi-Data (SPMD) as the default a"
X Link 2025-10-16T16:08Z 31.2K followers, 157.8K engagements
"kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first library for elastic GPU sharing across LLMs. ๐ https://t.co/3BC7B6s2EX ๐งต๐ Why it matters: https://t.co/jdIg1gyyOS ๐ End the GPU Cost Crisis Today Headache with LLMs lock a whole GPU but leave capacity idle Frustrated by your cluster's low utilization We launch kvcached the first"
X Link 2025-10-21T23:22Z 31.4K followers, 55.7K engagements
"its tokenization again ๐คฏ did you know tokenize(detokenize(token_ids)) token_ids RL researchers from Agent Lightning coined the term Retokenization Drift a subtle mismatch between what your model generated and what your trainer thinks it generated. why because most agents call LLMs via OpenAI-compatible APIs that only return strings so when those strings get retokenized later token splits may differ (HAV+ING vs H+AVING) tool-call JSON may be reformatted or chat templates may vary. unstable learning off-policy updates training chaos. ๐ฌ (@karpathy has a great video explaining all details about"
X Link 2025-10-22T15:17Z 31.2K followers, 171K engagements
"๐ Excited to share our work on batch-invariant inference in vLLM Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill). Let's dive into how we built this ๐งต๐"
X Link 2025-10-22T20:02Z 31.2K followers, 38.8K engagements
"๐@Kimi_Moonshot co-founder @ppwwyyxx talking about Moonshots Decode Context Parallel open source contribution to @vllm_project at @PyTorch conf"
X Link 2025-10-24T00:20Z 31.3K followers, 17.1K engagements
"vLLM ๐ค @nvidia = open scalable agentic AI you can run anywhere. ๐งต Strengthening our partnership with @nvidia: vLLM serves the NVIDIA Nemotron family. This new blog from NVIDIA walks through how to deploy open high-accuracy agentic inference across data center and edgefast reproducible and production-ready. ๐ Model highlight Nemotron Nano [--] (9B): a small language reasoning model with a hybrid TransformerMamba design and a tunable thinking budget. Open weights + 9T tokens of open data on Hugging Face (permissive license). Excels at reasoning/coding instruction following tool calling and"
X Link 2025-10-24T02:50Z 31.2K followers, 19.2K engagements
"Wow OCR models are taking off in vLLM ๐ Small but powerful ๐ช Enjoy this fast OCR model from @staghado โ You might have seen a lot of OCR release recently. Here is another one introducing ๐ฆ LightOnOCR-1B A fully end-to-end differentiable VLM model competing with all the latest releases while being much faster๐ https://t.co/jSaku5huLb You might have seen a lot of OCR release recently. Here is another one introducing ๐ฆ LightOnOCR-1B A fully end-to-end differentiable VLM model competing with all the latest releases while being much faster๐ https://t.co/jSaku5huLb"
X Link 2025-10-24T04:33Z 31.3K followers, 17.9K engagements
"๐ Congrats to the @minimax_ai team for releasing MiniMax-M2 model Built for advanced coding and agentic tasks MiniMax-M2 is now available with Day-0 support on vLLM bringing fast efficient inference and smooth long-context performance. vLLM is proud to power the next generation of deployable intelligence. Check for the latest usage guide #MiniMax #vLLM #AI https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html Were open-sourcing MiniMax M2 Agent & Code Native at 8% Claude Sonnet price 2x faster โก Global FREE for a limited time via MiniMax Agent & API - Advanced Coding"
X Link 2025-10-27T05:07Z 31.2K followers, 149.2K engagements
"๐ New in vLLM Semantic Router: Parallel LoRA execution Lock-free concurrency with OnceLock Flash Attention [--] for [--] faster inference Rust Go FFI for cloud-native ML Smarter routing safer concurrency faster inference all in one release. ๐ Full blog: #vLLM #LoRA #FlashAttention #AIInfra #SemanticRouter https://blog.vllm.ai/2025/10/27/semantic-router-modular.html https://blog.vllm.ai/2025/10/27/semantic-router-modular.html"
X Link 2025-10-27T14:15Z 31.2K followers, 24.7K engagements
"vLLM Sleep Mode ๐ด โกZero-reload model switching for multi-model serving. Benchmarks: [-----] faster switches and 6188% faster first inference vs cold starts. Explanation Blog by @EmbeddedLLM ๐ Why its fast: we keep the process alive preserving the allocator CUDA graphs and JIT-compiled kernels. Cold starts pay these costs every time. Two levels: L1 offloads weights to CPU (fastest wake). L2 discards weights (minimal RAM). Both are dramatically faster than full reloads. Works from multiple GPU TP/PP/EP. Read the full benchmarks + decision guide by @EmbeddedLLM Blog link"
X Link 2025-10-28T07:12Z 31.2K followers, 34.8K engagements
"โจ At vLLM we strive for correctness reliability and open collaboration every detail matters. Together with @Kimi_Moonshot we verified Kimi K2s tool-calling accuracy on vLLM using the latest K2-Vendor-Verifier benchmark. Our debugging uncovered [--] key compatibility issues all now fixed โ
: [--] Missing add_generation_prompt [--] Empty content handling [--] Strict tool-call ID parsing ๐ After fixes: Kimi K2 models now reach 99.9% request success rate and 76% schema accuracy on vLLM a [---] improvement We also found that some schema errors came from tools called outside the current context a known"
X Link 2025-10-28T10:15Z 31.3K followers, 17K engagements
"Wow excited to see PewDiePie using vLLM to serve language models locally ๐ vLLM brings easy fast and cheap LLM serving for everyone ๐ฅฐ PewDiePie in 2025: built a [------] rig runs Llama 70B gpt-oss-120B & Qwen 245B locally via vLLM built a custom web UI (chat RAG search TTS) ran protein-folding simulations for charity created an AI council a swarm of [--] models now fine-tuning his own model https://t.co/7otKqRH0D5 PewDiePie in 2025: built a [------] rig runs Llama 70B gpt-oss-120B & Qwen 245B locally via vLLM built a custom web UI (chat RAG search TTS) ran protein-folding simulations for charity"
X Link 2025-11-03T07:02Z 31.2K followers, 166K engagements
"Wow Quantization-enhanced Reinforcement Learning using vLLM Great job by @yukangchen_ ๐ We open-sourced QeRL Quantization-enhanced Reinforcement Learning ๐ง 4-bit quantized RL training ๐ช Train a 32B LLM on a single H100 GPU โ [---] faster overall training ๐ฏ Accuracy on par with bfloat16-level accuracy ๐ฅ Supports NVFP4 quantization format Moreover we show https://t.co/56GP7H8SEC We open-sourced QeRL Quantization-enhanced Reinforcement Learning ๐ง 4-bit quantized RL training ๐ช Train a 32B LLM on a single H100 GPU โ [---] faster overall training ๐ฏ Accuracy on par with bfloat16-level accuracy"
X Link 2025-11-04T02:40Z 31.2K followers, 23.5K engagements
"Amazing work by @RidgerZhu and the ByteDance Seed team Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. ๐ฅ The Ouro model is now runnable on vLLM (nightly version) bringing efficient inference to this new paradigm of latent reasoning. Thrilled to release new paper: Scaling Latent Reasoning via Looped Language Models. TLDR: We scale up loop language models to [---] billion parameters and pretrained on [--] trillion tokens. The resulting model is on par with SOTA language models of [--] to 3x size. https://t.co/6iauhVZ83g Thrilled to release new paper:"
X Link 2025-11-04T13:06Z 31.2K followers, 32.3K engagements
"๐ Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped https://t.co/tmZHPpCw3H https://t.co/BESzbBKtl3 Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped https://t.co/tmZHPpCw3H https://t.co/BESzbBKtl3"
X Link 2025-11-05T14:11Z 31.3K followers, 27.7K engagements
"๐ Day [--] support: Kimi K2 Thinking now running on vLLM In partnership with @Kimi_Moonshot we're proud to deliver official support for the state-of-the-art open thinking model with 1T params 32B active. Easy deploy in vLLM (nightly version) with OpenAI-compatible API: What makes it special: โก Native INT4 quantization [--] faster inference ๐พ Half the memory footprint no accuracy loss ๐ฏ 256K context stable across 200-300 tool calls ๐ฏ Official recipe & deployment guide included World-class reasoning now accessible to everyone. ๐ฆ Model: ๐ Recipes: #vLLM #KimiK2 #LLMInference"
X Link 2025-11-06T15:29Z 31.2K followers, 34.4K engagements
"vLLM now optimized for @Intel Arc Pro B-Series GPUs ๐ Delivering accessible high-performance LLM serving with exceptional price-to-performance. Key highlights: 80%+ MoE hardware efficiency with persistent zero-gap kernels 24GB VRAM [---] GB/s bandwidth [---] AI engines Full TP/PP/EP support for multi-GPU scaling DeepSeek distilled models (8B-70B) GPT-OSS Qwen validated MLPerf v5.1: Arc Pro B60 shows strong performance-per-dollar on Llama 8B ๐ Advanced optimizations: Single persistent kernel loop (eliminates launch overhead) Dynamic load balancing with atomic scheduling Hardware-friendly"
X Link 2025-11-11T15:29Z 31.2K followers, 16.4K engagements
"๐ No More TrainInference Mismatch We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) the first open-source run where training and inference numerics match exactly. It only takes [--] steps: [--] Make vLLM batch-invariant (same seq same output regardless of batching) [--] Ensure forward passes in training use identical kernels as inference [--] Add custom backward passes in PyTorch โ
Verified on Qwen3 1.7B + GSM8K: batch_inv_ON (bitwise exact) KL=0.0 faster convergence higher reward batch_inv_OFF reduced reward instability We audited every op imported vLLMs fused"
X Link 2025-11-12T16:47Z 31.2K followers, 91.4K engagements
"uv helps vLLM out of dependency hell The simple vLLM nightly index installation has been broken by xformers for a while and people have to use a fairly complicated command to work around. With the new --exclude feature from uv 0.9.8 we can restore the simplicity of installing vLLM uv self update echo xformers exclude.txt uv pip install -U vllm --extra-index-url --excludes exclude.txt @charliermarsh is going to further simplify the command to directly take package names from --excludes. Interested Please help voice out https://wheels.vllm.ai/nightly https://wheels.vllm.ai/nightly"
X Link 2025-11-13T04:51Z 31.2K followers, 57.5K engagements
"Check out a great framework built upon vLLM to serve Any-to-Any multimodal models Cornserve ๐ฝ is a distributed serving platform for complex Any-to-Any multimodal models https://t.co/Kr5lrq828i Cornserve ๐ฝ is a distributed serving platform for complex Any-to-Any multimodal models https://t.co/Kr5lrq828i"
X Link 2025-11-17T05:33Z 31.4K followers, 22.4K engagements
"When we added support for gpt-oss the Responses API didn't have a standard and we essentially reverse-engineered the protocol by iterating and guessing based on the behavior. We are very excited about the Open Responses spec: clean primitives better tooling consistency for the win Today were announcing Open Responses: an open-source spec for building multi-provider interoperable LLM interfaces built on top of the original OpenAI Responses API. โ
Multi-provider by default โ
Useful for real-world workflows โ
Extensible without fragmentation Build https://t.co/SJiBFx1BOF Today were announcing"
X Link 2026-01-16T04:14Z 31K followers, 60.4K engagements
"Day-0 support for GLM-4.7-Flash is now available in vLLM ๐ A new standard for the 30B classefficient lightweight and powerful for coding & agents. Also great for creative writing translation and long-context tasks. PR: https://github.com/vllm-project/vllm/pull/31386 Introducing GLM-4.7-Flash: Your local coding and agentic assistant. Setting a new standard for the 30B class GLM-4.7-Flash balances high performance with efficiency making it the perfect lightweight deployment option. Beyond coding it is also recommended for creative writing https://t.co/gd7hWQathC"
X Link 2026-01-20T01:21Z 31K followers, 32.5K engagements
"Quick vLLM Tip ๐ Batch Invariance Deterministic offline inference Same prompt + different batch size can produce different outputs. Fix: VLLM_BATCH_INVARIANT=1 โ
Docs: https://docs.vllm.ai/en/latest/features/batch_invariance/#offline-inference https://docs.vllm.ai/en/latest/features/batch_invariance/#offline-inference"
X Link 2026-01-20T15:00Z 30.5K followers, 11.3K engagements
"Quick update: from v0.14.0 onward CI ships ROCm Python wheels + Docker images by default. That means you can pip install or pull prebuilt images without extra build steps. Nightly builds are still on the way. Try it here: Hope this helps #vLLM #ROCm #AMD http://vllm.ai/#quick-start http://vllm.ai/#quick-start http://vllm.ai/#quick-start http://vllm.ai/#quick-start"
X Link 2026-01-21T10:56Z 30.5K followers, 17.7K engagements
"๐ Heading to AAAI [----] in Singapore Come meet the vLLM-Omni team at Expo Hall [--] Booth A50 ๐ Saturday Jan [--] 11:30 AM 12:30 PM We'll be sharing how vLLM-Omni unifies LLM vision and diffusion workloads into a single inference stackand what's coming next. See you there #AAAI2026 #vLLM #vLLMOmni #OpenSource https://twitter.com/i/web/status/2014279556140863793 https://twitter.com/i/web/status/2014279556140863793"
X Link 2026-01-22T10:10Z 30.5K followers, [----] engagements
"Congrats to the @Alibaba_Qwen team on Qwen3-TTS ๐ vLLM-Omni is ready with day-0 support voice cloning voice design and natural language control for emotion & prosody all running natively. Offline inference available now via PR #895 online serving coming soon. ๐ https://github.com/vllm-project/vllm-omni/pull/895 Qwen3-TTS is officially live. Weve open-sourced the full familyVoiceDesign CustomVoice and Basebringing high quality to the open community. - [--] models (0.6B & 1.8B) - Free-form voice design & cloning - Support for [--] languages - SOTA 12Hz tokenizer for high compression -"
X Link 2026-01-22T14:45Z 31K followers, 42.3K engagements
"Quick vLLM Tip ๐ Auto Max Model Length Running 10M context models like Llama [--] Scout OOM on startup because default context exceeds GPU memory --max-model-len auto (or -1) vLLM auto-fits the max context length to your GPU. Works with hybrid models TP/PP configs. No more manual tuning"
X Link 2026-01-26T15:00Z 30.5K followers, 11.3K engagements
"๐ Congrats @Kimi_Moonshot on Kimi K2.5 a native multimodal agentic model built on 15T vision-language tokens. vLLM day-0 support is ready. If you're building multimodal agents this one's worth evaluating: Vision-native: Pre-trained on 15T vision-language tokens not a bolted-on encoder Code from vision: Feed it UI designs or video workflows get working code Agent Swarm: Breaks complex tasks into parallel sub-agents automatically PR: Serving command: https://github.com/vllm-project/vllm/pull/33131 ๐ฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐น Global SOTA on Agentic Benchmarks:"
X Link 2026-01-27T05:47Z 31K followers, 15.3K engagements
"๐ DeepSeek-OCR [--] introducing Visual Causal Flow from @deepseek_ai learning to read documents the way humans do now running on vLLM โก with vllm==0.8.5 day-0 support. ๐ง Replaces fixed raster scanning with learned causal token reordering via DeepEncoder V2. ๐ [--] visual token compression only [-------] tokens per image. ๐ 91.09% on OmniDocBench v1.5 (+3.73%) reading order errors cut by 33% repetition down 3040% in production. Model: Github: https://github.com/deepseek-ai/DeepSeek-OCR-2/ https://huggingface.co/deepseek-ai/DeepSeek-OCR-2 https://github.com/deepseek-ai/DeepSeek-OCR-2/"
X Link 2026-01-27T08:27Z 30.7K followers, 14.7K engagements
"๐ Congrats to @arcee_ai on releasing Trinity Large with day-0 support in vLLM A 398B sparse MoE with 13B active params trained on 17T+ tokens delivering frontier-level quality with efficient compute. You can serve it now with vLLM ๐ Thanks to the Arcee AI and vLLM community for the collaboration Today were releasing the first weights from Trinity Large our first frontier-scale model in the Trinity MoE family. https://t.co/2zEm4WWMLZ Today were releasing the first weights from Trinity Large our first frontier-scale model in the Trinity MoE family. https://t.co/2zEm4WWMLZ"
X Link 2026-01-28T01:28Z 30.5K followers, 12.4K engagements
"Welcome everyone to the Office Hours ๐ Quick reminder: the Events page on our website has been synced with many SIG meetings including Multimodal Omni AMD torch-compile and CI. Choose the SIG you care about most and get involved. You can also join the discussion in the corresponding #sig-xxx channels on Were excited to have you join the conversation ๐ http://slack.vllm.ai http://vllm.ai/events http://vllm.ai/events vLLM Office Hours this week. Well kick off with our bi-weekly @vllm_project update from core committer @mgoin_ then dive into the vLLM CPU Offloading Connector with the"
X Link 2026-01-28T15:00Z 30.6K followers, [----] engagements
"Nemotron [--] Nano in NVFP4 just dropped from @NVIDIA 4x throughput on B200 (vs FP8-H100) with accuracy preserved via Quantization-Aware Distillation. The checkpoint is already supported by vLLM ๐คThanks NVIDIA vLLM community https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4"
X Link 2026-01-28T17:21Z 31.1K followers, [----] engagements
"Missed Dynamo Day [----] Our session on large-scale LLM serving with vLLM from @simon_mo_ is now available on NVIDIA On-Demand. Covers disaggregated inference Wide-EP for MoE and rack-scale deployments on GB200 NVL72. Thanks @nvidia for hosting Watch recording: #vLLM #NVIDIA https://www.nvidia.com/en-us/on-demand/session/other25-dynamoday06/playlistId=playList-e42aee58-4db9-4ce4-8a6f-c41d8e272d72 https://www.nvidia.com/en-us/on-demand/session/other25-dynamoday06/playlistId=playList-e42aee58-4db9-4ce4-8a6f-c41d8e272d72"
X Link 2026-01-30T03:19Z 30.7K followers, [----] engagements
"๐ vLLM v0.15.0 is here [---] commits from [---] contributors (39 new) Highlights: โก Async scheduling + Pipeline Parallelism ๐ง Mamba prefix caching (2x speedup) โซ Blackwell FP4 65% faster ๐ฅ AMD RDNA3/RDNA4 consumer GPU support ๐ค Kimi-K2.5 Molmo2 Eagle2.5-8B VLM EAGLE3 speculative decoding More updates: ๐ https://github.com/vllm-project/vllm/releases/tag/v0.15.0 https://github.com/vllm-project/vllm/releases/tag/v0.15.0"
X Link 2026-01-31T02:14Z 31.1K followers, 38.7K engagements
"๐๐๐ Congrats to @StepFun_ai on releasing Step [---] Flash and day-0 support is ready in vLLM A 196B MoE that activates only 11B params per token giving you frontier reasoning with exceptional efficiency. Highlights: 74.4% SWE-bench Verified 51.0% Terminal-Bench [---] 256K context with 3:1 Sliding Window Attention for cost-efficient long context Built for coding agents and long-horizon agentic tasks Check out our detailed deployment recipe below ๐ https://docs.vllm.ai/projects/recipes/en/latest/StepFun/Step-3.5-Flash.html โก Step [---] Flash is coming: Fast Enough to Think. Reliable Enough to Act"
X Link 2026-02-02T17:22Z 31K followers, 16.5K engagements
"๐ฅMultimodal models are rapidly gaining traction in production AI and vLLM is often the go-to inference engine for running them at scale. Check out this new @MLCommons @MLPerf benchmark with Qwen3-VL + vLLM on @Shopify product catalog data๐ ๐ NEW: @MLPerf Inference v6.0 debuts Qwen3-VL + @Shopify Product Catalog benchmark 40M products daily. Real production data. First Qwen model in MLPerf. Submit by Feb [--] [----] https://t.co/tWhbcaxdVo #MLPerf #VLM #Shopify #MLCommons https://t.co/001OEnpyLV ๐ NEW: @MLPerf Inference v6.0 debuts Qwen3-VL + @Shopify Product Catalog benchmark 40M products"
X Link 2026-02-02T22:35Z 31K followers, [----] engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing