@lmarena_ai lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org) posts on X about ai, leaderboard, arena, open ai the most. They currently have [-------] followers and [---] posts still getting attention that total [---------] engagements in the last [--] hours.

Engagements: [---------] #

Mentions: [--] #

Followers: [-------] #

CreatorRank: [-------] #

Social Influence

Social category influence technology brands social networks stocks travel destinations finance vc firms celebrities automotive brands cryptocurrencies musicians

Social topic influence ai, leaderboard, arena #157, open ai #2804, 6969 #132, agentic #372, categories, in the, math, top 10

Top assets mentioned Alphabet Inc Class A (GOOGL) Microsoft Corp. (MSFT)

Top Social Posts

Top posts by engagements in the last [--] hours

"The open-source clone of GPT-4 is almost here LLaVA combines a vision encoder and Vicuna enabling it to see recognize talk about and reason about images in a different way. 🔥Visual Instruction Tuning with GPT-4 We release LLaVA a Language-and-Vision Assistant that exhibits some near multimodal GPT-4 level capabilities: - 🤖Visual Chat: 85% relative score of GPT-4 - 🧪Science QA on reasoning: New SoTA 92.53% beats multimodal chain-of-thoughts https://t.co/ORsiPoRP5I 🔥Visual Instruction Tuning with GPT-4 We release LLaVA a Language-and-Vision Assistant that exhibits some near multimodal GPT-4"
X Link 2023-04-18T22:38Z 57K followers, 81.8K engagements

"Introducing FastChat's OpenAI-compatible API server Many applications have been built on OpenAI APIs but now you can effortlessly port them to use open-source alternatives without modifying the code. Blog (with LangChain + Vicuna as an example):"
X Link 2023-06-09T15:45Z 24.2K followers, [---] engagements

"🔥Introducing LongChat🤖 our new chatbots supporting 16K tokens context and LongEval our new benchmark for testing long context chatbots. 🤥Surprisingly we found open LLMs often fail to achieve their promised context length. Check our blog for details:"
X Link 2023-06-29T23:38Z 24.2K followers, 101.1K engagements

"LongChat applies RoPE condensing and is finetuned using curated long conversation data. It handles a context of up to 16K tokens while still precisely following human instructions in dialogues showing strong human preferences on our MT-Bench"
X Link 2023-06-29T23:38Z 24.3K followers, [----] engagements

"LongEval is designed to test the long-range ability of chatbots e.g. retrieval and association in long sequences. Using LongEval we find that many open LLMs do not perform well even on a much shorter context than claimed"
X Link 2023-06-29T23:38Z [--] followers, [----] engagements

"On the opposite side we find proprietary models like GPT-3.5-turbo-16K and Anthropic-Claude-100K achieve near-perfect performance. We also find LongChat achieves the best in open-source -- shows promising results in closing the gap"
X Link 2023-06-29T23:38Z 24.3K followers, [----] engagements

"LongChat is currently available in two sizes: 7B and 13B. The preview versions are available at and"
X Link 2023-06-30T00:52Z 24.1K followers, [----] engagements

"How good is Llama [--] Chat Key insights from our eval: [--]. Llama-2 exhibits stronger instruction-following skills yet still significantly lags behind GPT-3.5/Claude in extraction/coding/math [--]. Overly sensitive to safety could cause misinterpretation on user queries [--]. Comparable"
X Link 2023-07-19T19:47Z 24.2K followers, 115.7K engagements

"Excited to release our latest Vicuna v1.5 series featuring 4K and 16K context lengths with improved performance on almost all benchmarks Vicuna v1.5 is based on the commercial-friendly Llama [--] and has extended context length via positional interpolation. Since its release"
X Link 2023-08-02T17:42Z 19.4K followers, 103.2K engagements

"Congrats The WizardLM team is continuously pushing the limits of open-source models. We'll also update the Elo ratings for WizardLM-70B as soon as we receive enough votes. You can compare WizardLM-70B Llama-70B Vicuna-33B all at"
X Link 2023-09-07T19:45Z 24.2K followers, 19.3K engagements

"We are preparing to release an LLM conversation dataset from (April-August). If you've used our website and want your conversation excluded please complete this form with the necessary details. The dataset will contain roughly [--] million real-world LLM conversations. Please also let us know what you want to see in this dataset release"
X Link 2023-09-11T17:11Z 24.2K followers, 61.7K engagements

"Mistral-7B is now available at under both the "Chatbot Arena" and "Single Model" tab. Test it yourself We are glad that our tools (FastChat/Skypilot/vLLM) helped the release of this model Chatbot Arena now serves over [---] billion parameters for open-source models with our scalable infrastructure and help from sponsors. If you want to see your model in the arena you can contact us and provide us API endpoints. https://chat.lmsys.org/ https://chat.lmsys.org/ https://chat.lmsys.org/ https://chat.lmsys.org/"
X Link 2023-09-27T18:45Z 57.1K followers, 43.4K engagements

"Alibaba's powerful LLM Qwen-14b-chat is now live on Arena Trained with 3T tokens it achieves SoTA across all benchmarks. Curious how it performs Challenge it with your toughest prompts Stay tuned until we find out its Elo. https://huggingface.co/Qwen/Qwen-14B-Chat http://chat.lmsys.org https://huggingface.co/Qwen/Qwen-14B-Chat http://chat.lmsys.org"
X Link 2023-10-12T17:56Z 75.1K followers, 35.6K engagements

"We believe in the Open Movement for AI. Since Llama's release we have been actively taking actions to open our models (Vicunas) code (FastChat) evals and launch Chatbot Arena a crowd-sourced open platform to gather human/LLM interactions with feedback. Our recent release LMSys-Chat-1M is a collection of one-million real-world human/LLM conversations collected from along with human feedback in Chatbot Arena convos. We believe our datasets can serve as a valuable resource to understand and advance LLMs in real-world settings. Moving forward we remain committed to contributing to the community"
X Link 2023-10-16T15:25Z 13.7K followers, 76.8K engagements

"We've seen impressive Elo score for Zephyr-7b-alpha from our internal leaderboard. Now the upgraded Zephyr-7b-beta is available at Arena Challenge it against 10+ LLMs on Arena 🤖⚔"
X Link 2023-10-27T07:44Z 13.9K followers, 17.4K engagements

"Wow Zephyr-7b-beta's Elo is soaring high now🔥 surpassing strong 13B models on our internal Arena leaderboard (based on [---] votes + calibration with 90K votes). Let's find out the best 7B model Evaluate Zephyr against 10+ top-tier LLMs + vote at"
X Link 2023-10-27T08:37Z 14.1K followers, 69.8K engagements

"Super awesome to see LLM-judge has been making real impact -- now integrated in MLflow Check out @databricks in-depth blog post about RAG eval and also our earlier study published in NeurIPS'23 https://arxiv.org/abs/2306.05685 MLflow [---] is out today with new support for LLM-based eval metrics among other features. Read about how we've been using it to improve our RAG apps at Databricks like our docs assistant: https://t.co/o6C1b83ML2 https://arxiv.org/abs/2306.05685 MLflow [---] is out today with new support for LLM-based eval metrics among other features. Read about how we've been using it to"
X Link 2023-11-01T00:42Z 46.5K followers, 11.7K engagements

"@databricks Here's our previous thread for paper summary"
X Link 2023-11-01T00:47Z 14.1K followers, [----] engagements

"We're super excited to partner with @kaggle welcoming the ML and data science community to Arena Yesterday's Kaggle launch we recorded the highest traffic to date since the Arena launch Over 4K votes in a day🗳 Our mission remains building an open and community-first platform for LLM + human feedback. Your participation is the key to AI openness. Join us"
X Link 2023-11-01T18:00Z 14.2K followers, 60.4K engagements

"Congrats OpenChat-3.5 for matching GPT-3.5 performance across major benchmarks It's now live at Come challenge it with your toughest prompts and see how it really compares with ChatGPT"
X Link 2023-11-06T15:00Z 14.2K followers, 48.5K engagements

"Update: we've brought Claude back online to Arena Now you can compare GPT-4-Turbo and Claude-2 side-by-side at Remember to cast your votes in anonymous mode to make them count on our leaderboard. Your feedback shapes the future of open data in AI. We're committed to our community and the promise of sharing human feedback data. Stay tuned for our next release"
X Link 2023-11-13T04:47Z 15.3K followers, 14.9K engagements

"Catch me if you can Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval) To validate results we followed OpenAI's decontamination method and found no evidence of contamination.🤔 Blog: 1/n"
X Link 2023-11-14T16:24Z 17.2K followers, 257.2K engagements

"Announcing S-LoRA: Serving Thousands of LoRA Adapters with a Single A100 S-LoRA boosts throughput by up to 4x and significantly increase concurrent adapter support by orders of magnitude when comparing against baselines (PEFT/vLLM-packed). Use cases S-LoRA enables emerging trends in customizable AI. Service providers now have the capability to deliver personalized adapters at scale leveraging shared base weights for cost efficiency. This breakthrough makes managing millions of adapters not just possible but practical. Key innovations behind S-LoRA include: [--]. Unified paging to reduce"
X Link 2023-11-15T18:48Z 16.8K followers, 98.4K engagements

"The amazing team behind this effort @ying11231 @shiyi_c98 @DachengLi177 Coleman Hooper Nicholas Lee Shuo Yang Christopher Chou @BanghuaZ @lm_zheng Kurt Keutzer @profjoeyg Ion Stoica"
X Link 2023-11-15T19:22Z 15.7K followers, [----] engagements

"News: GPT-4-Turbo has surged to the top of Arena leaderboard now the best model as rated by 2500+ Arena user votes Its rapid climb to the top overtaking GPT-4 in strength speed and context length is truly impressive. Congrats to @OpenAI for this incredible achievement"
X Link 2023-11-17T16:51Z 18.2K followers, 1.4M engagements

"4/8 Each decoding step has [--] branches: the lookahead branch maintains a fixed-sized 2D window to gen n-grams from the Jacobi trajectory; the verification branch verifies promising n-grams. We impl the [--] in one atten mask to further utilize GPU's parallel computing power"
X Link 2023-11-21T20:10Z 16.8K followers, [----] engagements

"5/8 Lookahead decoding generates and verifies many N-grams at once w/o external sources. This increases the cost of a step but also enhances the likelihood of accepting a longer n-gram. In other words lookahead decoding allows to trade more flops for reducing latency"
X Link 2023-11-21T20:10Z 16.9K followers, [----] engagements

"6/8 We examine this scaling behavior between flops vs. latency reduction and find a scaling law: when the n-gram size is sufficiently large (e.g. 11-gram) exponentially increasing the future token guesses (i.e. window size) can linearly reduce the number of decoding steps"
X Link 2023-11-21T20:10Z 24.3K followers, [----] engagements

"7/8 Our study shows lookahead decoding substantially reduce latency ranging from 1.5x to 2.3x on different datasets on a single GPU. See figure below. For very latency-sensitive apps lookahead decoding can use FLOPs from [--] GPUs to achieve even greater latency reduction"
X Link 2023-11-21T20:10Z 16.8K followers, [----] engagements

"8/8 We've developed an implementation compatible with huggingface/transformers. Users can enhance the performance of HF's native generate function with just a few LoC. Code: We are working on an optimized CUDA kernel stay tuned"
X Link 2023-11-21T20:10Z 24.3K followers, [----] engagements

"Congrats @AnthropicAI on the new Claude-2.1 release It is now online on Arena for side-by-side comparisons and blind test mode. Next leaderboard release we'll separate votes for different versions of GPT-4 Claude to better track their performance over time. Stay tuned"
X Link 2023-11-21T23:10Z 16.8K followers, 12.7K engagements

"Yi-34B-Chat is now on Arena⚔ Congrats @01AI_Yi on the new model release🔥 Yi-34B is top-1 on @huggingface leaderboard across major benchmarks. It's time to challenge it with your tough tasks at We'll soon find out its Elo ranking with real user votes🗳"
X Link 2023-11-24T07:16Z 17.1K followers, 20.5K engagements

"Yes @Teknium1's OpenHermes [---] is live on Arena alongside 20+ strong models Also happy to share that 200000+ votes has been cast by Arena users covering a diverse set of real-world queries. More analysis coming soon. Join us in determining top models your vote matters"
X Link 2023-11-29T06:08Z 17.2K followers, 25K engagements

"@gokuldas17708 @Teknium1 We just moved from OpenAI to Azure which has built-in moderation and may generate some false positive. We're working on how to turn it off"
X Link 2023-11-29T18:44Z 17.1K followers, [--] engagements

"Big congrats to @perplexity_ai on their exciting launch of online LLM APIs The Live LLM 🛜 pplx-70b-online is now available on Arena and beating gpt-3.5 in real-time queries about @Tesla CyberTruck Curious Come chat & vote:"
X Link 2023-12-01T22:07Z 17.4K followers, 71.4K engagements

"Arena live update: 1000+ new votes have just rolled in for Mixtral-8x7b Excitingly Mixtral-8x7b is overtaking Tulu-2-70B as the top open model and achieving 50% winrate against gpt-3.5-turbo. Let's cast more votes and challenge it with the toughest prompt at Stay tuned for more update"
X Link 2023-12-12T21:04Z 19.4K followers, 626.9K engagements

"♊Gemini is now in the Arena. Excited to see its ranking with human evals Meanwhile our server just hit the highest traffic since May with [-----] votes in just [--] days. How incredible Huge thanks to @karpathy and the amazing community😂Let's vote at"
X Link 2023-12-14T13:50Z 19.4K followers, 267.7K engagements

"We're honored to be sponsored by @a16z big thanks We remain committed to the community by developing open models data and evals platform. Check out our efforts: - Vicuna - Arena evals - 1M-Chat dataset - FastChat - MT-Bench More coming soon"
X Link 2023-12-14T23:30Z 19.4K followers, 21.3K engagements

"Arena Update We've collected over [----] and [----] votes for Mixtral-8x7B and Gemini Pro. Both show strong performance against GPT-3.5-Turbo. Big congrats again on the release @MistralAI @GoogleDeepMind Full leaderboard:"
X Link 2023-12-15T18:31Z 24.1K followers, 582.5K engagements

"Introducing VTC: the first fair scheduler for LLM serving. Are you troubled by OpenAI's rate limit Although it's a necessary mechanism to prevent any single client from monopolizing the request queue we demonstrate that it doesn't guarantee either fairness among clients or high resource utilization. Fairness directly transfers to good user experience in contemporary LLM services. These services foundational to complex application structures are akin to system services in operating systems. They not only require but also benefit significantly from a fair scheduler. In this paper we - Formally"
X Link 2024-01-05T19:29Z 24.3K followers, 61.2K engagements

"Arena We've paused GPT-4-Turbo in Arena's direct chat mode due to budget constraints but we're eagerly seeking support to reactivate it for free. Access to top-tier models like GPT-4-Turbo not only enriches our insights but also motivates our users to engage and contribute valuable data. Please let us know if you'd like to support us and our community :) We're also preparing for the next dataset release including user prompt and feedback data. Stay tuned for more details"
X Link 2024-01-06T17:40Z 24.2K followers, 60.2K engagements

"Just received a big shoutout to @OpenAI's incredible speed gpt-4-turbo will be up again soon. @lmsysorg Credits granted @lmsysorg Credits granted"
X Link 2024-01-07T02:33Z 24.2K followers, 26.3K engagements

"Arena Exciting update Mistral Medium has gathered 6000+ votes and is showing remarkable performance reaching the level of Claude. Congrats @MistralAI We have also revamped our leaderboard with more Arena stats (votes CI). Let us know any thoughts :) Leaderboard"
X Link 2024-01-10T12:33Z [--] followers, 500.8K engagements

"We are thrilled to introduce SGLang our next-generation interface and runtime for LLM inference It greatly improves the execution and programming efficiency of complex LLM programs by co-designing the front-end language and back-end runtime. On the backend we propose RadixAttention a novel technique that automatically handles various patterns of KV cache reuse. On the frontend we designed a flexible prompting language for you to control the generation process. SGLang can perform up to 5x faster than existing systems like Guidance and vLLM on common LLM workloads (agent reasoning chat RAG"
X Link 2024-01-17T17:41Z 54.3K followers, 247.2K engagements

"Some people are confused about the differences between SGLang Runtime and vLLM. To clarify these are two distinct yet closely collaborating projects. SGLang Runtime has imported some model and layer implementations from vLLM but has redesigned the batching and caching scheduler to incorporate new features such as RadixAttention. Currently the vLLM team is actively working to port some optimizations from SGLang Runtime into vLLM. As of now the SGLang frontend supports multiple backends including OpenAI Anthropic Gemini and SGLang Runtime (backend) itself. Plans are also in place to support"
X Link 2024-01-19T18:43Z 24.2K followers, [----] engagements

"Besides Bard two exciting new models are also added - StripedHyena-Nous-7B (Beyond Transformer architecture by @togethercompute @NousResearch) - DeepSeek-67B-Chat @deepseek_ai Come challenge them and vote at Links: - -"
X Link 2024-01-22T17:42Z 24.1K followers, [----] engagements

"Arena Excited to announce a new sponsorship by @togethercompute A huge thanks to Together AI for their generous support bringing models like Qwen1.5 @NousResearch models and more into our community. Now five new players have joined in the Arena. - Qwen1.5-72b-chat - Qwen1.5-7b-chat - Qwen1.5-4b-chat - OpenChat-3.5-0106 - Nous-Hermes-2-Mixtral-8x7b-DPO Big congrats to Qwen team @openchatdev @NousResearch on new model launch. Come challenge the 30+ Arena players and let the community decide who wins out We've also upgraded our Gradio server to the latest version with smoother experience."
X Link 2024-02-07T22:43Z 26.8K followers, 45.9K engagements

"@togethercompute @NousResearch We thank all our sponsors @kaggle @mbzuai @a16z @anyscalecompute and @huggingface. Our platform is not possible without support for you all"
X Link 2024-02-07T23:37Z 25K followers, [----] engagements

"Arena Update @alibaba_cloud's Qwen1.5-72B has soared to the top-10 of the leaderboard with 5000+ votes becoming the best open model and matching performance on par with Mistral-medium Huge congrats to the Qwen team for their incredible work and contribution to the open-source community More highlights: - Arena votes 300K - GPT-3.5-Turbo-0125 shows improvement over its predecessor (1106) with 50% cost reduction - OpenChat-3.5-0106 now ranks among the top 7B models - Score updates for Qwen1.5-7B/4B and Nous-Hermes-2-Mixtral-8x7B-DPO - Mistral-7B-instruct-v0.2 in the Arena We'll also soon"
X Link 2024-02-15T20:20Z 71.2K followers, 85.6K engagements

"🔥Big congrats to @AnthropicAI on the impressive Claude-3 release Claude-3 Opus and Sonnet are now in the Arena. Come chat & vote at 🗳 http://chat.lmsys.org Today we're announcing Claude [--] our next generation of AI models. The three state-of-the-art modelsClaude [--] Opus Claude [--] Sonnet and Claude [--] Haikuset new industry benchmarks across reasoning math coding multilingual understanding and vision. https://t.co/TqDuqNWDoM http://chat.lmsys.org Today we're announcing Claude [--] our next generation of AI models. The three state-of-the-art modelsClaude [--] Opus Claude [--] Sonnet and Claude [--] Haikuset"
X Link 2024-03-04T15:12Z 60.8K followers, 168.9K engagements

"🔥Exciting news from Arena @Anthropic's Claude-3 Ranking is here📈 Claude-3 has ignited immense community interest propelling Arena to unprecedented traffic with over [-----] votes in just three days We're amazed by Claude-3's extraordinary performance. Opus is making history as the first model to rival GPT-4-Turbo while Sonnet stands out with its speed and performance closely matching GPT-4. Huge congrats to @Anthropic for this remarkable launch [----] looks incredibly exciting. Can't wait to see what comes next. 🔥Big congrats to @AnthropicAI on the impressive Claude-3 release Claude-3 Opus and"
X Link 2024-03-07T16:19Z 28.7K followers, 165.3K engagements

"One of the best 7B models Starling-LM-7B has now upgraded to Beta🔥It shows promising potential in our coming next generation benchmark. Now in Arena accepting votes 🚀 Presenting Starling-LM-7B-beta our cutting-edge 7B language model fine-tuned with RLHF 🌟 Also introducing Starling-RM-34B a Yi-34B-based reward model trained on our Nectar dataset surpassing our previous 7B RM in all benchmarks. ✨ We've fine-tuned the latest Openchat 🚀 Presenting Starling-LM-7B-beta our cutting-edge 7B language model fine-tuned with RLHF 🌟 Also introducing Starling-RM-34B a Yi-34B-based reward model trained"
X Link 2024-03-22T19:06Z 29.9K followers, 16K engagements

"Congrats @databricks for the exciting DBRX launch🔥 dbrx-instruct is now in Arena accepting votes Meet DBRX a new sota open llm from @databricks. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks and - as an MoE - inference is blazingly fast. Simply put it's the model your data has been waiting for. https://t.co/JdGx12eeu8 Meet DBRX a new sota open llm from @databricks. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks and - as an MoE - inference"
X Link 2024-03-27T21:55Z 46.5K followers, 35.1K engagements

"Congrats @Alibaba_Qwen on the new Qwen1.5-32B release🔥 We have shipped it to Arena thanks to the fast & reliable endpoint support by @togethercompute Qwen1.5 72B has been the best open model on Chatbot Arena leaderboard. Very excited to see how the 32B performs 🤔 Have you met this problem before 14B is not capable enough but 72B is too large 🎾 A 30B model might be the sweet spot and now we finally have it the new member of Qwen1.5 series Qwen1.5-32B Blog: https://t.co/b5VeO7H6Ep HF: https://t.co/VmLL1lGOa0 search repos https://t.co/ZawClzFzS3 🤔 Have you met this problem before 14B is not"
X Link 2024-04-05T16:04Z 32.2K followers, 26.8K engagements

"Exciting news - the latest Arena result are out @cohere's Command R+ has climbed to the 6th spot matching GPT-4-0314 level by 13K+ human votes It's undoubtedly the best open model on the leaderboard now🔥 Big congrats to @cohere's incredible work & valuable contribution to the open community More exciting updates: - Qwen1.5-32B-Chat almost top-10 - Gemma-1.1-7B-it shows great improvement (1044 - [----] on par with Llama-2-70b) - Starling-7B-Beta still the best 7B with over 13K votes"
X Link 2024-04-09T09:30Z 35.3K followers, 751.4K engagements

"Congrats @openai on the new GPT-4 Turbo launch🔥 The model is now in the Arena Come challenge it with your toughest prompts🧩 Majorly improved GPT-4 Turbo model available now in the API and rolling out in ChatGPT. Majorly improved GPT-4 Turbo model available now in the API and rolling out in ChatGPT"
X Link 2024-04-09T19:18Z 59.8K followers, 67.2K engagements

"🔥Exciting news -- GPT-4-Turbo has just reclaimed the No. [--] spot on the Arena leaderboard again Woah We collect over 8K user votes from diverse domains and observe its strong coding & reasoning capability over others. Hats off to @OpenAI for this incredible launch To offer more insights we introduce Category in Arena and dive deeper into comparisons across domains including Coding Longer Query and Multilingual capabilities"
X Link 2024-04-11T22:48Z 60.1K followers, 655.6K engagements

"Wow nearly 3K votes overnight -- A huge shoutout to our amazing community Confidence intervals are narrowing and Llama-3 remains strong Big congrats to @AIatMeta for this incredible launch & contribution to open community. Full result coming out soon. Early 1K votes are in and Llama-3 is on FIRE🔥The New king of OSS model Vote now and make your voice heard Leaderboard update coming very soon. https://t.co/L9h9QrCkjl Early 1K votes are in and Llama-3 is on FIRE🔥The New king of OSS model Vote now and make your voice heard Leaderboard update coming very soon. https://t.co/L9h9QrCkjl"
X Link 2024-04-19T16:45Z 36.9K followers, 49.4K engagements

"Exciting update -- Llama-3 full result is out now reaching top-5 on the Arena leaderboard🔥 We've got stable enough CIs with over 12K votes. No question now Llama-3 70B is the new king of open model. Its powerful 8B variant has also surpassed many larger-size models. What an incredible launch Huge congrats to Llama team at @AIatMeta and for such valuable contribution to open community Can't wait to see the 400B"
X Link 2024-04-22T18:56Z 94.5K followers, 1.1M engagements

"More exciting news today -- Gemini [---] Pro result is out Gemini [---] Pro API-0409-preview now achieves #2 on the leaderboard surpassing #3 GPT4-0125-preview to almost top-1 Gemini shows even stronger performance on longer prompts in which it ranks joint #1 with the latest GPT-4-Turbo🔥 Big congrats to @GoogleDeepMind on shipping this powerful Gemini API to developers & community. Very excited to see what app can be built on top Congrats @GoogleDeepMind on shipping Gemini [---] Pro to public review Upon capacity & latency testing we have now brought Gemini [---] Pro up to the Arena🤖 Big"
X Link 2024-04-23T02:17Z 42K followers, 478.3K engagements

"Congrats @Microsoft for the open release of Phi-3 their next generation of fast and capable model We've collected 6K+ votes for Phi-3 and pushed a new leaderboard release. The model is definitely showing great potentials of its size. Excited to see more community fine-tunes More news: - Arena got 800K votes🗳🔥 - Snowflake Arc Instruct just joined Arena phi-3 is here and it's . good :-). I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning (And ofc this wouldn't be complete without the"
X Link 2024-04-26T20:40Z 39.8K followers, 125.3K engagements

"Thanks for the incredible enthusiasm from our community We really didn't see this coming. Just a couple of things to clear up: - In line with our policy we've worked with several model developers in the past to offer community access to unreleased models/checkpoints (e.g. mistral-next gpt2-chatbot) for preview testing - We don't put these private checkpoints on our leaderboard until they become fully public - Due to unexpectedly high traffic & capacity limit we have to temporarily take gpt2-chatbot offline. Please stay-tuned for its broader releases :) We're all about bringing more amazing"
X Link 2024-04-30T19:44Z 40.3K followers, 209.4K engagements

"Exciting news -- we're thrilled to announce that LMSYS + @kaggle are launching a human preference prediction competition with $100000 in prizes Your challenge is to predict which responses users will prefer in head-to-head battles between LLMs in the Chatbot Arena real-world data. Our newly released dataset for competition contains 55000+ user/LLM conversations and preferences with over [--] cutting-edge LLMs such as GPT-4 Claude [--] Llama [--] and Mistral models. Join the competition to win the prize"
X Link 2024-05-02T18:29Z 41.2K followers, 84.2K engagements

"Exciting update - the latest leaderboard result is here We have collected fresh 30K votes for three new strong models from @RekaAILabs Reka Core and @Alibaba_Qwen Qwen Max/110B Reka Core has climbed to the 7th spot matching Claude-3 Sonnet and almost Llama-3 70B Meanwhile Qwen Max / 110B also achieves top-10/13. Big congrats Reka and Qwen for the great achievements"
X Link 2024-05-08T22:07Z 76.4K followers, 154.9K engagements

"To conclude Llama [--] has reached performance on par with top-tier proprietary models in overall use cases. Congrats again to the Llama team @AIatMeta for such a valuable contribution to the community Moving forward we expect to push new categories to the leaderboard soon based on the above analysis. More insights to come on all Arena models"
X Link 2024-05-09T00:19Z 41.1K followers, 11.1K engagements

"Breaking news gpt2-chatbots result is now out gpt2-chatbots have just surged to the top surpassing all the models by a significant gap (50 Elo). It has become the strongest model ever in the Arena With improvement across all boards especially reasoning & coding capabilities we're excited to see what app can build on top. Huge congrats to @OpenAI for this incredible milestone Note: this is an internal screenshot. Its public version "gpt-4o" is now in Arena and will soon appear on the public leaderboard"
X Link 2024-05-13T19:11Z 43.7K followers, 342.3K engagements

"In more challenging Coding Arena we see even bigger gap (100 Elo)"
X Link 2024-05-13T19:11Z 42K followers, 15.2K engagements

"@redjojovic @teortaxesTex @01AI_Yi if someone can host an endpoint for @deepseek_ai we will bring it up"
X Link 2024-05-20T22:31Z 42K followers, [---] engagements

"Update Evals don't stop on Friday night We're very excited to introduce the new Gemini-1.5 Family (Flash Pro and Gemini Advanced) to Arena with improved capabilities across the board. Come and challenge them with your toughest prompts :) leaderboard will be updated soon http://chat.lmsys.org http://chat.lmsys.org Gemini [---] Model Family: Technical Report updates now published In the report we present the latest models of the Gemini family Gemini [---] Pro and Gemini [---] Flash two highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information"
X Link 2024-05-25T04:14Z 42.8K followers, 55.7K engagements

"Big news Gemini [---] Flash Pro and Advanced results are out🔥 - Gemini [---] Pro/Advanced at #2 closing in on GPT-4o - Gemini [---] Flash at #9 outperforming Llama-3-70b and nearly reaching GPT-4-0125 () Pro is significantly stronger than its April version. Flashs cost capabilities and unmatched context length make it a market game-changer Huge congrats to @GoogleDeepMind on the incredible Gemini launches Can't wait to see what new applications Gemini unlocks More breakdown analysis below👇"
X Link 2024-05-28T17:47Z 42.8K followers, 286.8K engagements

"Update Arena welcomes new players - Phi-3-Medium/Small @Microsoft - Qwen2-72B-Instruct @Alibaba_Qwen - Yi-1.5-34B-Chat @01AI_Yi Challenge these amazing state-of-the-art open models with your toughest prompt Stay tuned for leaderboard"
X Link 2024-05-31T18:15Z 56.8K followers, 60K engagements

"@Microsoft @Alibaba_Qwen @01AI_Yi Chat & Vote at http://chat.lmsys.org http://chat.lmsys.org"
X Link 2024-05-31T18:16Z 43.7K followers, [----] engagements

"Chatbot Arena update We're thrilled to welcome Qwen2-72B to our leaderboard now the best open model in Chinese (#7) Impressive performance across the board: - Significant improvement from v1.5-110B - Matching GPT-4-0314 in Overall - Nearly catching up to the best open model Llama-3-70B in "Hard Prompts" Congrats to @Alibaba_Qwen on this incredible milestone A huge contribution to the open community. More plots below👇 💗Hello Qwen2 Happy to share the Qwen2 models to you all 📖 BLOG: https://t.co/0UNwRo1Iea 🤗 HF collection: https://t.co/z6oWkw7Kzb 🤖 https://t.co/Bp56AqQpQJ 💻 GitHub:"
X Link 2024-06-07T11:03Z 46K followers, 79K engagements

"Congrats @nvidia on the exciting 340B model release The model was tested under the codename "june-chatbot" and is now coming out of stealth with impressive performance surpassing Llama-3-70b across hard benchmarks like Arena-Hard-Auto. The new best open model Come play with Nemotron-4-340B and vote yourself. Leaderboard update coming very soon Today we are happy to release best open models for synthetic data generation. 340B parameters includes base instruct and reward models. As well as new human preference dataset HelpSteer2. 340B-Reward model is #1 on the Reward Bench leaderboard."
X Link 2024-06-14T18:27Z 45.1K followers, 109.9K engagements

"Chatbot Arena update @NVIDIAAI's Nemotron-4-340B has just edged past Llama-3-70B to become the new best open model on Arena leaderboard Key highlights: - Impressive performance in longer queries - Balanced multilingual capabilities - Robust performance in "Hard Prompts" Congrats @NVIDIAAI for this remarkable milestrone & contribution to the open community Check out more plots below👇"
X Link 2024-06-17T22:50Z 46K followers, 92.6K engagements

"Exciting news - Chatbot Arena now supports image uploads📸 Challenge GPT-4o Gemini Claude and LLaVA with your toughest questions. Plot to code VQA story telling you name it. Let's get creative and have fun Leaderboard coming soon. Credits to builders @chrischou03 @lisabdunlap @ying11231 @lm_zheng @infwinston @ml_angelopoulos & advisors Ion Stoica @profjoeyg @trevordarrell Find link below👇"
X Link 2024-06-19T15:47Z 47.4K followers, 173.4K engagements

"Chatbot Arena update @deepseek_ai DeepSeek-Coder-v2 has climbed to #4 in Coding Arena nearing GPT-4-Turbo levels Its now the top open model for coding. It's also strong in Hard Prompts ranking #11 but sits at #20 in Overall generic questions. Meanwhile @ChatGLM GLM-0520 from Zhipu AI/Tsinghua impresses at #9 in Coding and #11 Overall. Chinese LLMs are getting more competitive than ever Our next leaderboard update on Sonnet [---] coming soon :) DeepSeek-Coder-V2: First Open Source Model Beats GPT4-Turbo in Coding and Math Excels in coding and math beating GPT4-Turbo Claude3-Opus Gemini-1.5Pro"
X Link 2024-06-23T19:57Z 46K followers, 61.8K engagements

"🔥Breaking News from Chatbot Arena @AnthropicAI Claude [---] Sonnet has just made a huge leap securing the #1 spot in Coding Arena Hard Prompts Arena and #2 in the Overall leaderboard. New Sonnet has surpassed Opus at 5x the lower cost and competitive with frontier models GPT-4o/Gemini [---] Pro across the boards. Huge congrats to @AnthropicAI for this incredible milestone Can't wait to see the new Opus & Haiku. Our new vision leaderboard is also coming soon (More analysis below)"
X Link 2024-06-24T19:59Z 46.1K followers, 176.4K engagements

"Model Strength CIs in Coding & Overall Arena. Find full data at http://learderboard.lmsys.org http://learderboard.lmsys.org"
X Link 2024-06-24T19:59Z 45.3K followers, 13.5K engagements

"Congrats @GoogleDeepMind on the new Gemma-2 27B & 9B release Gemma-2 was tested in the Arena under the codename "*late-june-chatbots" and now out of stealth. Its early result matches the best open models (Llama-3-70B Nemotron-340B) with only 27B parameters Impressively Gemma-2-9B is ranked as high as Qwen-2-72B. The rate of improvement is fast. Great news to the open community More links below. Gemma [--] is out As with our first model we're super focused on creating models at useful practical sizes so that they can be easily deployable. all the while being amazing in quality. We upgraded our 9B"
X Link 2024-06-27T16:49Z 46.2K followers, 79.8K engagements

"Early result of Gemma [--] on the leaderboard matching Llama-3-70B. - Full data at - Chat with Gemma [--] at - Gemma [--] blog http://goo.gle/3RLQXUa http://chat.lmsys.org http://leaderboard.lmsys.org http://goo.gle/3RLQXUa http://chat.lmsys.org http://leaderboard.lmsys.org"
X Link 2024-06-27T16:49Z 45.6K followers, 12.8K engagements

"🔥Exciting News we are thrilled to announce Chatbot Arenas Vision Leaderboard Over the past [--] weeks weve collected 17K+ votes across diverse use cases. Highlights: - GPT-4o leads the way followed by Claude [---] Sonnet in #2 and Gemini [---] Pro in #3 - Open model LLaVA-v1.6-34B achieves comparable score with Claude-3-Haiku - Notable gap among vision-language models compared to text-only scenario Congrats to all for these remarkable achievements This is just the beginning. Let's bring more tough tasks to challenge model's limits (blog post link and more plots below👇) Exciting news - Chatbot"
X Link 2024-06-28T13:59Z 45.8K followers, 78.9K engagements

"Multi-turn conversations with LLMs are crucial for many applications today. Were excited to introduce a new category "Multi-Turn" which includes conversations with =2 turns to measure models' abilities to handle longer interactions. Key findings: - 14% Arena votes are multi-turn - Claude models' scores increased significantly. Claude [---] Sonnet becomes joint #1 with GPT-4o. - Gemma-2-27B and Llama-3-70B are the best open models now joint #10. Let us know your thoughts"
X Link 2024-06-30T02:34Z 46.1K followers, 66.4K engagements

"Not all questions need GPT-4 We introduce RouteLLM a routing framework based on human preference data that directs simple queries to a cheaper model. With data augmentation techniques RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance. Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper. Our model datasets and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top 1/6 (blog post"
X Link 2024-07-01T16:25Z 47.2K followers, 276.9K engagements

"We also build a little demo. For example coding question was routed to GPT-4-1106 and blog writing to Mixtral-8x7b. Link: 5/6 https://0c83f754b05f4a2208.gradio.live/ https://0c83f754b05f4a2208.gradio.live/"
X Link 2024-07-01T16:25Z 45.9K followers, [----] engagements

"Credits to our amazing team & collab with @anyscalecompute: @isaacongjw @AmjadMahayri @Cveinnt @infwinston @WthThao @profjoeyg @waleedk Ion Stoica Links: - Blog post: - Framework: - Models and datasets: - Benchmark details: - Paper: 6/6 https://arxiv.org/abs/2406.18665 https://github.com/lm-sys/RouteLLM/tree/main/benchmarks https://huggingface.co/routellm https://github.com/lm-sys/RouteLLM http://lmsys.org/blog/2024-07-01-routellm/ https://github.com/lm-sys/RouteLLM http://lmsys.org/blog/2024-07-01-routellm/ https://arxiv.org/abs/2406.18665"
X Link 2024-07-01T16:25Z 45.8K followers, [----] engagements

"We need the world's fastest typer to join Arena. We need a human baseline on lm-sys We need a human baseline on lm-sys"
X Link 2024-07-04T19:27Z 46K followers, 13.6K engagements

"Instruction-Following Arena - Claude-3.5/GPT-4o joint #1 (in CIs) - Gemma-2-27B #1 Open Model - Early GPT-4/Claudes all UP"
X Link 2024-07-09T20:31Z 46.5K followers, [----] engagements

"Math Arena (Pt 2) Ranking shifts quite a lot: - Mistral-8x22b UP - Gemma2-9b Llama3-8b Command-r drop - Phi-3 series UP"
X Link 2024-07-09T20:31Z 46.5K followers, [----] engagements

"@jeremyphoward @Drachs1978 That might be because youre doing tasks like coding or math. Claude is top in the arena for these tasks"
X Link 2024-07-15T01:10Z 46.1K followers, [---] engagements

"Congrats @openai on the new GPT-4o mini release GPT-4o mini's early version "upcoming-gpt-mini" was tested in Arena in the past week. With over 6K user votes we are excited to share its early score reaching GPT-4-Turbo performance while offering significant cost reduction (15/60 cent per million input/output tokens). Its official version "gpt-4o-mini" is now in the Arena. We're re-collecting votes & will update its result to leaderboard soon Were continuing to make advanced AI accessible to all with the launch of GPT-4o mini now available in the API and rolling out in ChatGPT today. Were"
X Link 2024-07-18T18:07Z 47.3K followers, 130.9K engagements

"In Math Arena GPT-4o mini drops a little but still gpt-4-turbo performance"
X Link 2024-07-23T21:02Z 47.6K followers, 16.6K engagements

"@redjojovic deploying it now :)"
X Link 2024-07-23T21:32Z 47.4K followers, [----] engagements

"People have been asking why GPT-4o mini ranks so high on Arena We truly appreciate all the feedback. A few things to note: [--]. Chatbot Arena measures human preference in different areas. We encourage everyone to not just look at the overall leaderboard but also per-category breakdown (e.g. Math Coding etc). [--]. Arena eval is LIVE as we tweet. We encourage everyone to compare models firsthand in Arena and test your hypotheses in real-time ;) [--]. Transparency is our core value; all our code and analysis are open source and we periodically release 20% of the data; we keep the rest of data private"
X Link 2024-07-24T07:18Z 47.6K followers, 189.7K engagements

"We are thrilled to announce the milestone release of SGLang Runtime v0.2 featuring significant inference optimizations after months of hard work. It achieves up to 2.1x higher throughput compared to TRT-LLM and up to 3.8x higher throughput compared to vLLM. It consistently delivers superior performance when serving Llama-8B to 405B models on A100/H100 with FP8/BF16. SGLang is fully open-source and implemented in Python. As it matures from a prototype we invite the community to join us in creating the next-generation efficient serving engine Learn more at"
X Link 2024-07-25T16:45Z 47.6K followers, 94.2K engagements

"Exciting news @metaai's Llama-3.1 results are here🔥 The Llama-3.1 series extensively tested over the past week has gathered over 10K community votes. Now Llama-3.1-405B has climbed to #3 on the Overall Arena leaderboard marking the first time an open model has ranked in the top [--] Huge congrats to @metaai for this incredible launch & milestone in AI We also thank @anyscalecompute for hosting the Llama-3.1 models for us. More category breakdown below👇"
X Link 2024-07-30T16:24Z 58.9K followers, 176.9K engagements

"@FZaslavskiy @amebagpt @metaai We follow the configuration recommended by Meta"
X Link 2024-07-30T19:36Z 48.1K followers, [---] engagements

"Congrats This year will be an exciting time for the open-source AI community. We are thrilled to announce a $7M raise to become the leading Open-Access AI Cloud 🤘🏼🌪 At Hyperbolic were building an open AI ecosystem and economy where everyone who contributes is rewarded. Our goal is not to merely optimize AI performance to compete with traditional Web2 https://t.co/pXTvmK98yO We are thrilled to announce a $7M raise to become the leading Open-Access AI Cloud 🤘🏼🌪 At Hyperbolic were building an open AI ecosystem and economy where everyone who contributes is rewarded. Our goal is not to"
X Link 2024-07-31T04:47Z 48.8K followers, 10.7K engagements

"Congrats @GoogleDeepMind on the Gemma-2-2B release Gemma-2-2B has been tested in the Arena under "guava-chatbot". With just 2B parameters it achieves an impressive score [----] on par with models 10x its size (For reference: GPT-3.5-Turbo-0613: [----] Mixtral-8x7b: 1114). This model is undoubtedly the best for on-device applications. Excited to see the new possibilities Gemma-2-2B will unlock Were welcoming a new [--] billion parameter model to the Gemma [--] family. 🛠 It offers best-in-class performance for its size and can run efficiently on a wide range of hardware. Developers can get started with"
X Link 2024-07-31T17:07Z 61.2K followers, 237.6K engagements

"Overall win-rate heatmap: Gemini [---] Pro (0801) wins 54% vs GPT-4o 59% vs Claude-3.5-Sonnet. Check out full data at and come chat with the model http://leaderboard.lmsys.org http://leaderboard.lmsys.org"
X Link 2024-08-01T16:33Z 51.6K followers, 36.3K engagements

"The new Mistral Large [--] is here🔥 It's now leading the Arena hard leaderboards. Super impressive performance in Coding Hard Prompts and Math surpassing top-tier models GPT-4 Turbo/Claude Opus Plus it's an open-weight model -- another great news for the open-source community. Huge Congrats @MistralAI Mistral Large [--] (2407) is now on @lmsysorg. It performs extremely well in the Coding Hard Prompts Math and Longer Query categories where it outperforms GPT4-Turbo and Claude [--] Opus. It is also doing very well in Instruction Following where it ranks above Llama [---] 405B. https://t.co/XCbgdzqEw7"
X Link 2024-08-06T15:50Z 51.8K followers, 79.4K engagements

"Exciting Update from Chatbot Arena The latest @OpenAI ChatGPT-4o (20240808) API has been tested under "anonymous-chatbot" for the past week with over [-----] community votes. OpenAI has now successfully re-claimed the #1 position surpassing Google's Gemini-1.5-Pro-Exp with an impressive score of [----] New ChatGPT-4o demonstrates notable improvement in technical domains particularly in Coding (30+ point over GPT-4o-20240513) as well as in Instruction-following and Hard Prompts. Huge congrats to @OpenAI on this remarkable achievement New ChatGPT-4o Category Rankings: - Overall: #1 - Math: #1-2 -"
X Link 2024-08-14T00:21Z 66K followers, 785.4K engagements

"Overall win-rate heatmap. Grok [--] official blog at We will test the official version and update it to leaderboard soon http://x.ai/blog/grok-2 http://x.ai/blog/grok-2"
X Link 2024-08-14T05:57Z 57.1K followers, 22K engagements

"Real-time battles between frontier AIs competition has never been fiercer Updated video showing the battle for the best LLM by the Top [--] companies based on the @lmsysorg Chatbot arena Elo scores https://t.co/f1dFgU4fUf Updated video showing the battle for the best LLM by the Top [--] companies based on the @lmsysorg Chatbot arena Elo scores https://t.co/f1dFgU4fUf"
X Link 2024-08-20T17:29Z 57.1K followers, 14.3K engagements

"New competitors have entered the Arena @Microsoft's Phi-3-Vision (4.2B) and @OpenBMB's MiniCPM-V-2.6 (8B) two lightweight SoTA vision-language models have been added. Will the gap between open and closed models in the vision leaderboard continue to narrow Jump in and test your toughest image reasoning challenges"
X Link 2024-08-20T20:53Z 57.1K followers, 28.1K engagements

"Were rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out ranking at the top across all categoriesMath Hard Prompts Coding and Instruction-following"
X Link 2024-08-23T17:52Z 56.4K followers, 76.6K engagements

"Full leaderboard view at http://lmarena.ai/leaderboard http://lmarena.ai/leaderboard"
X Link 2024-08-23T17:52Z 57.1K followers, 17.7K engagements

"Learn more about Grok-2 at https://x.ai/blog/grok-2 https://x.ai/blog/grok-2"
X Link 2024-08-23T17:52Z 57.1K followers, 12.3K engagements

"Chatbot Arena update⚡ The latest Gemini (Pro/Flash/Flash-9b) results are now live with over 20K community votes Highlights: - New Gemini-1.5-Flash (0827) makes a huge leap climbing from #23 to #6 overall - New Gemini-1.5-Pro (0827) shows strong gains in coding math over previous versions. - The new smaller Gemini-1.5 Flash-8b outperforms gemma-2-9b matching llama-3-70b levels. Big Congrats @GoogleDeepMind Gemini team on the incredible launch More plots in the followup posts👇 **Note: to better reflect community interests older models nearing deprecation will soon be removed from the default"
X Link 2024-08-27T18:56Z 57.1K followers, 465.5K engagements

"Check out the new SGLang release featuring performance boosts with DeepSeek MLA optimization and torch.compile We're also introducing multi-image/video inputs with LLaVA-OneVision support now live in the Chatbot Arena served by the latest SGLang. Try the Code: LLaVA-OneVision demo at http://lmarena.ai https://github.com/sgl-project/sglang We're excited to announce the release of SGLang v0.3 featuring enhanced performance and extended support for novel architectures Highlights include: - Up to 7x higher throughput for DeepSeek Multi-Head Latent Attention (MLA) - Up to 1.5x lower latency with"
X Link 2024-09-04T19:49Z 57K followers, 22K engagements

"We're launching something new. Sign up now to become a beta tester and get early access to Copilot Arena 👨💻🤖 https://forms.gle/o8Qh7SccrVEkuXnX6 https://forms.gle/o8Qh7SccrVEkuXnX6"
X Link 2024-09-05T16:38Z 57.1K followers, 33.8K engagements

"⚠WARNING: offensive content ahead. Introducing RedTeam Arena with Bad Wordsour first game. You've got [--] seconds to break the model to say the bad word. The faster the better. (Collaboration with @elder_plinus and the awesome BASI 🐍 community.) Link to the site below👇 Deuxime Entre 🥁 INTRODUCING: https://t.co/kXtGhZNnyF 🏟 CALLING ALL AI HACKERS LIBERATORS AND EXPLORERS This is not a bug bounty program. This is not your grandma's jailbreak arena. This is a fleet of misfit pirates exploring the latent space in the name of Deuxime Entre 🥁 INTRODUCING: https://t.co/kXtGhZNnyF 🏟 CALLING ALL"
X Link 2024-09-06T23:36Z 57K followers, 185.2K engagements

"Chatbot Arena update We added a new "Style Control" button to the leaderboard Now you can apply it to Overall and Hard Prompts to see how rankings shift. We're dedicated to continually improving the leaderboard please share your feedback Learn more details in our blog👇 Does style matter over substance in Arena Can models "game" human preference through lengthy and well-formatted responses Today we're launching style control in our regression model for Chatbot Arena our first step in separating the impact of style from substance in https://t.co/EYF762uAGF Does style matter over substance in"
X Link 2024-09-10T19:04Z 57.1K followers, 43.3K engagements

"Congrats @OpenAI on the exciting o1 release o1-preview and o1-mini are now live in Chatbot Arena accepting votes. Come challenge them with your toughest math/reasoning prompts We're releasing a preview of OpenAI o1a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science coding and math. https://t.co/peKzzKX1bu We're releasing a preview of OpenAI o1a new series of AI models designed to spend more time thinking before they respond. These models can reason through"
X Link 2024-09-13T01:02Z 57.1K followers, 292.7K engagements

"@OpenAI Chat with o1 at Leaderboard coming soon http://lmarena.ai http://lmarena.ai"
X Link 2024-09-13T01:02Z 57.1K followers, [----] engagements

"Chatbot Arena update🔥 We've been testing the latest ChatGPT-4o (20240903) over the past [--] weeks and the results show significant improvements across the board: - Overall: [----] - [----] - Overall (style control): [----] - [----] - Hard Prompts: [----] - [----] - Multi-turn: [----] - [----] Congrats @OpenAI for pushing the boundaries yet again Next o1 results will be updated very soon. Stay tuned. Note: its older version (20240808) has been deprecated from the default leaderboard view. Our latest GPT-4o released September [--] [----] has been tested in @lmsysorg arena. This model accessible to all users for the"
X Link 2024-09-16T23:36Z 57.1K followers, 106.3K engagements

"Full result at http://lmarena.ai/leaderboard http://lmarena.ai/leaderboard"
X Link 2024-09-16T23:36Z 57.1K followers, [----] engagements

"The model is available via OpenAI API as "chatgpt-4o-latest""
X Link 2024-09-16T23:43Z 57K followers, [----] engagements

"No more waiting. o1's is officially on Chatbot Arena We tested o1-preview and mini with 6K+ community votes. 🥇o1-preview: #1 across the board especially in Math Hard Prompts and Coding. A huge leap in technical performance 🥈o1-mini: #1 in technical areas #2 overall. Huge congrats to @OpenAI on this incredible milestone Come try the king of LLMs and vote at More analysis below👇 http://lmarena.ai Congrats @OpenAI on the exciting o1 release o1-preview and o1-mini are now live in Chatbot Arena accepting votes. Come challenge them with your toughest math/reasoning prompts"
X Link 2024-09-18T16:32Z 59.3K followers, 489.7K engagements

"Chatbot Arena Leaderboard overview. @openai's o1-preview #1 across the board and o1-mini #1 in technical areas"
X Link 2024-09-18T16:32Z 59.3K followers, 11.7K engagements

"We are happy to announce a new site for Chatbot Arena Over the past year with the incredible support of our community Chatbot Arena has evolved into a mature ecosystem and platform. We believe it's time for it to graduate and stand on its own. By giving Chatbot Arena its own platform we aim to provide it with more independence and ensure its long-term growth. With a strong partnership with LMSys we're expanding the platform to evaluate frontier models not only for chatbots but also in areas like coding complex tasks and red-teaming. LMSys has been a research collective dedicated to a variety"
X Link 2024-09-20T20:51Z 57.1K followers, 75.5K engagements

"Exciting update from Vision Chatbot Arena Weve gathered over 6K new votes for the latest open vision models (Qwen2 Llama [---] Pixtral) and ChatGPT-4o. - ChatGPT-4o has taken the #1 spot surpassing Gemini. - Open models (Qwen Llama [---] Pixtral) are rapidly improving matching proprietary offerings Competition in vision is heating up. Cast your vote and help decide the best vision model Full leaderboard link below👇"
X Link 2024-09-27T18:37Z 56.9K followers, 145.2K engagements

"As part of Chatbot Arena's graduation🎓 we're excited to announce that we changed our X handle to @lmarena_ai For open-source systems & research at LMSys please follow @lmsysorg. This account @lmarena_ai will be dedicated to sharing Arena projects & leaderboard updates. See you tomorrow for another one 👀 We are happy to announce a new site for Chatbot Arena Over the past year with the incredible support of our community Chatbot Arena has evolved into a mature ecosystem and platform. We believe it's time for it to graduate and stand on its own. By giving Chatbot Arena its We are happy to"
X Link 2024-10-06T17:38Z 77.1K followers, 190.6K engagements

"Big News from Chatbot Arena @01AI_YI's latest model Yi-Lightning has been extensively tested in Arena collecting over 13K community votes Yi-Lightning has climbed to #6 in the Overall rankings (#9 in Style Control) matching top models like Grok-2. It delivers robust performance in technical areas like Math Hard Prompts and Coding. Huge congrats to @01AI_YI Meanwhile GLM-4-Plus by Zhipu AI (@ChatGLM) has also entered the top [--] marking a strong surge for Chinese LLMs. They're quickly becoming highly competitive. Stay tuned for more More analysis below👇 Yi-Lightning is now in Chatbot Arena The"
X Link 2024-10-15T17:43Z 61K followers, 163.2K engagements

"Yi-Lightning Category Rankings: - Overall: #6 (Style Control #9) - Math: #3 - Coding: #4 - Hard Prompts: #4 GLM-4-Plus - Overall: #9 (Style Control #15) - Math: #8 - Coding: #4 - Hard Prompts: #6"
X Link 2024-10-15T17:43Z 57K followers, [----] engagements

"Check out full result at http://lmarena.ai/leaderboard http://lmarena.ai/leaderboard"
X Link 2024-10-15T17:51Z 56.8K followers, [----] engagements

"Congrats @nvidia on the impressive Nemotron release Llama-3.1-Nemotron-70B trained with REINFORCE just set a new record in our Arena-Hard benchmark matching top models like GPT-4o Now it comes the real testNemotron is live in Chatbot Arena for human evaluation. Come ask tough prompts at lmarena. Stay tuned for leaderboard updates Our Llama-3.1-Nemotron-70B-Instruct model is a leading model on the 🏆 Arena Hard benchmark (85) from @lmarena_ai. Arena Hard uses a data pipeline to build high-quality benchmarks from live data in Chatbot Arena and is known for its predictive ability of Chatbot"
X Link 2024-10-17T18:28Z 57K followers, 34.7K engagements

"No more waiting @allen_ai's Molmo-72B vision model is live in Chatbot Arena Molmo outperforms GPT-4o in academic vision benchmarks and matches it in human ratingsa big milestone for open vision-language models. Test it with tough image questions at lmarena. ai. Leaderboard updates soon Meet Molmo: a family of open state-of-the-art multimodal AI models. Our best model outperforms proprietary systems using 1000x less data. Molmo doesn't just understand multimodal datait acts on it enabling rich interactions in both the physical and virtual worlds. Try it https://t.co/kS4W1wYDPx Meet Molmo: a"
X Link 2024-10-17T18:50Z 57K followers, 30.3K engagements

"🔥New benchmark: Preference Proxy Evaluations (PPE) Can reward models guide RLHF Can LLM judge replace real human evals PPE addresses these questions Highlights: - Real-world human preference from Chatbot Arena💬 - 16000+ prompts and 32000+ diverse model responses🗿 - Includes verifiable correctness preferences at scale 💯 - Explicitly correlated to downstream RLHF outcomes 📈 Initial Takeaways: - Ensemble-LLM-Judge performs better - Open RMs have room for improvement See more in the thread"
X Link 2024-10-22T17:30Z 57.1K followers, 22K engagements

"Llama-3.1-Nemotron-70b by @nvidia is now on the Arena leaderboard with overall rank #9 and #26 with Style Control Impressive to see a 70B open model competitive in human preference as well as interesting ranking shifts under style control. Comment below to share your thoughts Llama-3.1-Nemotron-70B-Instruct model aligned by our team is now live on https://t.co/cPhNJ7C8T4 leaderboard with overall rank [--]. Everything used to create this model is public: code data and reward model. HF checkpoint: https://t.co/QhqlGPNtWB https://t.co/vOhJ8F9okP Llama-3.1-Nemotron-70B-Instruct model aligned by our"
X Link 2024-10-24T19:17Z 94.6K followers, 28.5K engagements

"Chatbot Arena Update🔥 @AnthropicAI latest Claude [---] Sonnet has been extensively tested in Arena securing an impressive #6 overall and #3 under style-control With over 7K community votes the new Sonnet is showing exceptional strength across various domains. Highlights: - Hard Prompts: #4 (#1 with SC) - Coding: #2 (#1 with SC) - Math: #3 Congrats to @AnthropicAI on the impressive new release More analysis below👇 Introducing an upgraded Claude [---] Sonnet and a new model Claude [---] Haiku. Were also introducing a new capability in beta: computer use. Developers can now direct Claude to use"
X Link 2024-10-28T18:57Z 94.6K followers, 108.2K engagements

"🚨New Chatbot Arena Category: Creative Writing Arena Creative writing (15% votes) involves originality artistic expression and often different from technical prompts. Key Findings: - o1-Mini drops below top models - Gemini [---] Pro/Flash [---] both UP significantly - ChatGPT-4o-Latest remains #1 with significant jump - New Sonnet [---] improved over previous version More analysis below👇"
X Link 2024-10-30T19:57Z 94.6K followers, 74.2K engagements

"Which model is best for coding @CopilotArena leaderboard is out Our code completions leaderboard contains data collected over the last month with 100K completions served and 10K votes Lets discuss our findings so far🧵"
X Link 2024-11-12T21:08Z 58.9K followers, 97.9K engagements

"Here are our main takeaways from the leaderboard: - With our prompting method Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion. - Within a tier we still observe slight fluctuations as we obtain more votes. - We find that GPT-4o-mini is much worse than all other models. 2/n"
X Link 2024-11-12T21:08Z 73.1K followers, [----] engagements

"Most current Copilot Arena users code in Python followed by javascript/typescript html/markdown and C++. 3/n"
X Link 2024-11-12T21:08Z 73.1K followers, 10.9K engagements

"Massive News from Chatbot Arena🔥 @GoogleDeepMind's latest Gemini (Exp 1114) tested with 6K+ community votes over the past week now ranks joint #1 overall with an impressive 40+ score leap matching 4o-latest in and surpassing o1-preview It also claims #1 on Vision leaderboard. Gemini-Exp-1114 excels across technical and creative domains: - Overall #3 - #1 - Math: #3 - #1 - Hard Prompts: #4 - #1 - Creative Writing #2 - #1 - Vision: #2 - #1 - Coding: #5 - #3 - Overall (StyleCtrl): #4 - #4 Huge congrats to @GoogleDeepMind on this remarkable milestone Come try the new Gemini and share your"
X Link 2024-11-14T17:17Z 60.5K followers, 719.9K engagements

"Congrats @NexusflowX on the latest Athene-V2-72B release matching top models across hard benchmarks Now it comes the real testAthene is live in Arena for human evaluation. Come ask tough prompts at lmarena. ai Introducing Athene-V2: An Open Model Suite Comparable to GPT-4o across Benchmarks With the scaling law plateauing the future of LLMs lies in targeted post-training to enhance and tailor model capabilities. Athene-V2 is designed to address this shift offering two specialized https://t.co/yGQubN4F9X Introducing Athene-V2: An Open Model Suite Comparable to GPT-4o across Benchmarks With the"
X Link 2024-11-15T18:19Z 59.2K followers, 54.1K engagements

"Exciting News from Chatbot Arena❤🔥 Over the past week the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot" gathering 8000+ community votes. The result OpenAI reclaims the #1 spot surpassing Gemini-Exp-1114 with an impressive [----] score Latest GPT-4o shows remarkable improvements we see a leap in creative writing (1365 1402) as well as technical domains (e.g. coding math). Category Rankings: - Overall: #2 #1 - Overall (StyleCtrl): #2 #1 - Creative Writing: #2 #1 - Coding: #2 #1 - Math: #4 #3 - Hard: #2 #1 Huge congrats @OpenAI More analysis below👇 GPT-4o got"
X Link 2024-11-20T18:49Z 60.3K followers, 418.1K engagements

"@koltregaskes @OpenAI oops that was a typo. we've fixed it on the leaderboard. http://lmarena.ai/leaderboard http://lmarena.ai/leaderboard"
X Link 2024-11-20T18:59Z 58.9K followers, [----] engagements

"Typo in chatgpt-4o-latest model name has been fixed. We've also deprecated its previous version on leaderboard but you can still chat with them all at http://lmarena.ai Exciting News from Chatbot Arena❤🔥 Over the past week the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot" gathering 8000+ community votes. The result OpenAI reclaims the #1 spot surpassing Gemini-Exp-1114 with an impressive [----] score https://t.co/ogmhhCW7zY http://lmarena.ai Exciting News from Chatbot Arena❤🔥 Over the past week the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as"
X Link 2024-11-20T19:30Z 58.9K followers, 34.2K engagements

"@iruletheworldmo the model joined arena 11/11 hence the date but at that time it's final release name was not known :)"
X Link 2024-11-20T19:36Z 58.7K followers, [---] engagements

"We're excited to announce the 2nd Chatbot Arena Challenge on Kaggle now featured in the WSDM Cup with $50K prize pool This time we're tackling on multilingual human preferences. Join the challenge today to win the prize Competition Launch WSDM Cup - Multilingual Chatbot Arena hosted by @lmarena_ai 🎯to predict which responses users will prefer in a head-to-head battle between chatbots powered by LLMs 💰 $50000 Prize Pool Entry Deadline: 1/27/25 This competition was selected for the Competition Launch WSDM Cup - Multilingual Chatbot Arena hosted by @lmarena_ai 🎯to predict which responses"
X Link 2024-11-20T20:38Z 59.4K followers, 32.4K engagements

"Woah huge news again from Chatbot Arena🔥 @GoogleDeepMinds just released Gemini (Exp 1121) is back stronger (+20 points) tied #1🏅Overall with the latest GPT-4o-1120 in Arena Ranking gains since Gemini-Exp-1114: - Overall #3 #1 - Overall (StyleCtrl): #5 - #2 - Hard Prompts (StyleCtrl): #3 #1 - Coding: #3 #1 - Vision: #1 - Math: #2 #1 - Creative Writing #2 #1 Congrats again @GoogleDeepMind The LLM race is on fire progress is now measured in days See more analysis below👇 Say hello to gemini-exp-1121 Our latest experimental gemini model with: - significant gains on coding performance - stronger"
X Link 2024-11-21T19:00Z 64K followers, 611.9K engagements

"Who's the best AI software engineer Introducing RepoChat Arena: the live AI software engineering battle🔥🤖 [--]. Input any public Github link (repo/issue/PR). [--]. Ask the models to fix issues add features or chat with a repo. [--]. Vote for the better one and shape the leaderboard Watch AI solve your real-world coding tasks live at RepoChat Exciting use cases in the thread below🧵"
X Link 2024-11-27T20:29Z 59.7K followers, 99.1K engagements

"Two weeks into our 2nd @Kaggle Challenge and the competition is heating up with 1200+ submissions Join us in tackling the multilingual Chatbot Arena and compete for a 50K prize pool Link below: We're excited to announce the 2nd Chatbot Arena Challenge on Kaggle now featured in the WSDM Cup with $50K prize pool This time we're tackling on multilingual human preferences. Join the challenge today to win the prize https://t.co/KFmKWp1FAQ We're excited to announce the 2nd Chatbot Arena Challenge on Kaggle now featured in the WSDM Cup with $50K prize pool This time we're tackling on multilingual"
X Link 2024-12-01T20:44Z 59.3K followers, 18.1K engagements

"Arena update: Pixtral Large has now overtaken Qwen-VL-72B to become the #1 open model in Vision Arena👀 Congrats @MistralAI on the remarkable open release. Check out the leaderboard to see the latest rankings We also released Pixtral Large a new SOTA vision model. https://t.co/cE7Drzasvv We also released Pixtral Large a new SOTA vision model. https://t.co/cE7Drzasvv"
X Link 2024-12-03T18:46Z 59.8K followers, 70K engagements

"@CopilotArenas server is now open-sourced🥳 Check out how we handle code completions and share your ideas for new system prompts Github: More links in thread🧵 https://github.com/lmarena/copilot-arena/tree/main/server https://github.com/lmarena/copilot-arena/tree/main/server"
X Link 2024-12-04T17:55Z 59.3K followers, [----] engagements

"New release - SGLang v0.4 🚀 Exciting News: SGLang v0.4 Release 🚀 SGLang v0.4 is here packed with major upgrades to power your LLM workflows Key highlights: 🔹 Zero-Overhead Batch Scheduler: 1.1x throughput boost keeping GPUs fully loaded. 🔹 Cache-Aware Load Balancer: Up to 1.9x throughput 3.8x https://t.co/MAe4s24uEi 🚀 Exciting News: SGLang v0.4 Release 🚀 SGLang v0.4 is here packed with major upgrades to power your LLM workflows Key highlights: 🔹 Zero-Overhead Batch Scheduler: 1.1x throughput boost keeping GPUs fully loaded. 🔹 Cache-Aware Load Balancer: Up to 1.9x throughput 3.8x"
X Link 2024-12-04T21:36Z 59.4K followers, [----] engagements

"Exciting New Launch: WebDev Arena🔥 Input any prompt and see which model builds the best website for you. Cool examples in thread🧵 Introducing WebDev Arena An arena where two LLMs compete to build a web app. You can vote on which LLM performs better and view a leaderboard of the best models. 100% Free and Open Source with @lmarena_ai. https://t.co/1LRsdG9eSi https://t.co/eR8DBjdolu Introducing WebDev Arena An arena where two LLMs compete to build a web app. You can vote on which LLM performs better and view a leaderboard of the best models. 100% Free and Open Source with @lmarena_ai."
X Link 2024-12-10T19:19Z 59.8K followers, 23.4K engagements

"Claude [---] vs GPT-4o building an Airbnb clone"
X Link 2024-12-10T19:19Z 60.1K followers, [----] engagements

"Claude [---] vs GPT-4o building a VS @code/@cursor_ai web app"
X Link 2024-12-10T19:19Z 59.6K followers, [---] engagements

"Claude [---] vs GPT-4o building an analytics dashboard"
X Link 2024-12-10T19:19Z 59.8K followers, [----] engagements

"Breaking News from Chatbot Arena⚡ @GoogleDeepMind Gemini-2.0-Flash debuts at #3 Overall - a massive leap from Flash-002 Highlights (improvement from Flash-002): - Overall: #11 #3 - Hard Prompts: #15 #2 - Coding: #22 #3 - Longer query: #8 #1 - Overall style-controlled: #19 #3 - Hard style-controlled: #25 #2 The pace of improvement is absolutely astounding Excited to see the new wave of applications powered by Flash. More analysis below👇 Were kicking off the start of our Gemini [---] era with Gemini [---] Flash which outperforms [---] Pro on key benchmarks at 2X speed (see chart below). Im"
X Link 2024-12-11T15:53Z 61K followers, 330.2K engagements

"Gemini-2.0-Flash is also top in Vision Arena"
X Link 2024-12-11T15:53Z 59.8K followers, [----] engagements

"In coding it's matching the best models (o1 gemini-exp-1206)"
X Link 2024-12-11T15:53Z 59.8K followers, [----] engagements

"@mattahmann @GoogleDeepMind @Alibaba_Qwen @deepseek_ai @MistralAI They are all in the Arena Leaderboard coming soon"
X Link 2024-12-11T16:09Z 59.7K followers, [---] engagements

"Thanks @tereza_tizkova at @e2b_dev featuring our latest launch WebDev Arena Check out the awesome video comparing LLMs on real Webdev tasks. Is Claude better than GPT or Gemini 👀 Which LLM is the best for creating full web apps WebDev Arena by @lmarena_ai just launched & you can test models for free. 🔸See comparison of web apps created by different LLMs 🔸 Vote for the best output 🔸Code from LLMs is executed https://t.co/FEnn44rYbz Is Claude better than GPT or Gemini 👀 Which LLM is the best for creating full web apps WebDev Arena by @lmarena_ai just launched & you can test models for"
X Link 2024-12-11T16:26Z 59.8K followers, [----] engagements

"Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX Stable Diffusion Dall-E Recraft Ideogram and more Who will reign supreme [--]. Describe your desired image🎨 [--]. Two anonymous models output images [--]. Vote for the winner Enjoy We will be releasing the leaderboard soon More examples below👇"
X Link 2024-12-12T20:14Z 61.1K followers, 104.6K engagements

""Naruto from the anime""
X Link 2024-12-12T20:14Z 60.2K followers, [---] engagements

"We take inspiration from image battle interfaces like: ❤@fal's And our pre-set images are from https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1 https://artificialanalysis.ai/text-to-image/arena https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena https://imgsys.org/ https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1 https://artificialanalysis.ai/text-to-image/arena https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena https://imgsys.org/"
X Link 2024-12-12T20:14Z 60.3K followers, [----] engagements

"WebDev Arena Leaderboard is now live with 10K+ votes #1. Claude [---] Sonnet #2. Gemini-Exp-1206 #3. Gemini-2.0-Flash #4. GPT-4o-2024-11-20 #5. Qwen2.5-Coder-32B #6. Gemini-1.5-Pro-002 Congrats @AnthropicAI topping the leaderboard by a significant margin🔥 Leaderboard link and examples below👇 Introducing WebDev Arena An arena where two LLMs compete to build a web app. You can vote on which LLM performs better and view a leaderboard of the best models. 100% Free and Open Source with @lmarena_ai. https://t.co/1LRsdG9eSi https://t.co/eR8DBjdolu Introducing WebDev Arena An arena where two LLMs"
X Link 2024-12-13T20:03Z 61.1K followers, 189.2K engagements

"Breaking news from Chatbot Arena⚡🤔 @GoogleDeepMind's Gemini-2.0-Flash-Thinking debuts as #1 across ALL categories The leap from Gemini-2.0-Flash: - Overall: #3 #1 - Overall (Style Control): #4 #1 - Math: #2 #1 - Creative Writing: #2 #1 - Hard Prompts: #1 #1 (+14 pts) - Vision: #1 #1 (+16 pts) A new paradigm in reasoning models is here More breakdown below 👇 Introducing Gemini [---] Flash Thinking an experimental model that explicitly shows its thoughts. Built on [---] Flashs speed and performance this model is trained to use thoughts to strengthen its reasoning. And we see promising results"
X Link 2024-12-19T17:16Z 64.5K followers, 328K engagements

"Gemini-2.0-Flash-Thinking #1 across all categories"
X Link 2024-12-19T17:16Z 61.2K followers, 183.8K engagements

"Notable improvement in Math Arena"
X Link 2024-12-19T17:16Z 60.3K followers, [----] engagements

"Gemini-2.0-Flash-Thinking is also #1 in Vision Arena"
X Link 2024-12-19T17:16Z 60.3K followers, [----] engagements

"Overall win-rate heatmap"
X Link 2024-12-19T17:16Z 60.3K followers, [----] engagements

"WebDev Arena update: o1-mini climbs to #2 @AnthropicAI's Claude [---] Sonnet remains #1 with a 100+ points lead. New entries: - #3: gemini-2.0-flash-thinking - #9: Llama-3.1-405b New leaderboard visualizations are live. See more in the thread👇"
X Link 2024-12-23T16:43Z 60.8K followers, 28.5K engagements

"Win-rate heatmap"
X Link 2024-12-23T16:43Z 60.8K followers, [---] engagements

"o1-mini vs Claude [---] Sonnet "Conway's Game of Life""
X Link 2024-12-23T16:43Z 60.7K followers, [----] engagements

"📣 @CopilotArena leaderboard is now live on We have collected over 15k votes on [--] models (2 new models since our last blogpost release). Congrats @deepseek_ai🥇and @AnthropicAI🥇 http://lmarena.ai/leaderboard Which model is best for coding @CopilotArena leaderboard is out Our code completions leaderboard contains data collected over the last month with 100K completions served and 10K votes Lets discuss our findings so far🧵 https://t.co/gBJ8qXiTIy http://lmarena.ai/leaderboard Which model is best for coding @CopilotArena leaderboard is out Our code completions leaderboard contains data"
X Link 2024-12-23T19:04Z 60.7K followers, 30.9K engagements

"what would you like lmarena to build/fix in [----] :)"
X Link 2024-12-25T19:53Z 60.9K followers, 28.3K engagements

"@jonkkillian o1 is in arena :) come vote"
X Link 2024-12-25T20:05Z 60.8K followers, [---] engagements

"DeepSeek V3 is now live in the Arena🔥 Congrats @deepseek_ai on the impressive release matching top proprietary models like Claude Sonnet/GPT-4o across standard benchmarks. Now it's time for the human test - come challenge it with your toughest prompts at lmarena and stay tuned for leaderboard release 🚀 Introducing DeepSeek-V3 Biggest leap forward yet: ⚡ [--] tokens/second (3x faster than V2) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n https://t.co/p1dV9gJ2Sd 🚀 Introducing DeepSeek-V3 Biggest leap forward yet: ⚡ [--] tokens/second (3x faster"
X Link 2024-12-26T18:27Z 61.1K followers, 51.9K engagements

"you can just measure things"
X Link 2024-12-29T21:08Z 60.7K followers, 11.7K engagements

"Exciting News from Chatbot Arena❤🔥 @OpenAI's o1 rises to joint #1 (+24 points from o1-preview) and @deepseek_ai DeepSeek-V3 secures #7 now the best and the only open model in the top-10 o1 Highlights: - Achieves the highest score under style control - #1 across all domains except creative writing DeepSeek-V3: - The most cost-effective model in the top-10 ($0.14/1M input token) - Robust performance under style control - Strong in hard prompts and coding Congrats both to the incredible achievements Looking forward to seeing more breakthroughs in [----]. More analysis in the thread👇"
X Link 2024-12-30T11:39Z 65.9K followers, 291.2K engagements

"o1 #1 across all domains except creative writing; DeepSeek strong in hard prompts and coding"
X Link 2024-12-30T11:39Z 61.1K followers, 163.1K engagements

"o1 #1 in Hard Prompt under Style Control"
X Link 2024-12-30T11:39Z 60.9K followers, [----] engagements

"Confidence interval of model strengths under style control: o1 gets the highest score. DeepSeek-v3 right after Claude-3.5-Sonnet"
X Link 2024-12-30T11:39Z 60.9K followers, [----] engagements

"o1 #1 in Math Arena"
X Link 2024-12-30T11:39Z 60.9K followers, [----] engagements

"Check out full data at and let us know your thoughts http://lmarena.ai/leaderboard http://lmarena.ai/leaderboard"
X Link 2024-12-30T11:39Z 60.9K followers, [----] engagements

"@MRmor44 @OpenAI @deepseek_ai check out the hard prompts ranking :) https://x.com/lmarena_ai/status/1873695391524536700 o1 #1 in Hard Prompt under Style Control. https://t.co/VgVMyzdn1N https://x.com/lmarena_ai/status/1873695391524536700 o1 #1 in Hard Prompt under Style Control. https://t.co/VgVMyzdn1N"
X Link 2024-12-30T12:00Z 60.7K followers, [----] engagements

"@testingcatalog @OpenAI @deepseek_ai yes great idea"
X Link 2024-12-30T12:02Z 60.7K followers, [----] engagements

"@VictorTaelin @OpenAI @deepseek_ai Yes Sonnet is #1 in Hard Prompt/Coding under style control We use categorization and techniques to decouple style from substance on user prompts. https://x.com/lmarena_ai/status/1829216988021043645 Does style matter over substance in Arena Can models "game" human preference through lengthy and well-formatted responses Today we're launching style control in our regression model for Chatbot Arena our first step in separating the impact of style from substance in https://t.co/EYF762uAGF https://x.com/lmarena_ai/status/1829216988021043645 Does style matter over"
X Link 2024-12-30T18:12Z 60.7K followers, [----] engagements

"@VictorTaelin @OpenAI @deepseek_ai Totally that's why we develop a data curation pipeline (Hard Prompts) to filter out low quality data More technical details in blog: https://blog.lmarena.ai/blog/2024/hard-prompts/ https://blog.lmarena.ai/blog/2024/hard-prompts/"
X Link 2024-12-30T18:20Z 60.7K followers, [---] engagements

"@VictorTaelin @OpenAI @deepseek_ai We're constantly surprised by how people use lmarena to test models in depth. We'll soon release a data visualization tool for community to understand more about the voter/prompt distribution"
X Link 2024-12-30T18:23Z 60.7K followers, [---] engagements

"@VictorTaelin @OpenAI @deepseek_ai random internet users likely not capable of asking "hard prompts" but only power users like you :) yes we also tried sonnet for criteria labelling and got similar results but for sure we are working on upgrading from llama-3. thanks for the feedback"
X Link 2024-12-30T18:33Z 60.7K followers, [---] engagements

"🚨WebDev Arena [----] Wrapped Three exciting new entries: - #2: Claude-3.5-Haiku - #4: o1-2024-12-17 - #7: DeepSeek-V3 Claude [---] Sonnet retains its #1 position with an 80+ point lead while [---] Haiku jumps to #2 Congrats again @AnthropicAI🎉 Check out more stats below👇"
X Link 2024-12-30T19:32Z 61K followers, 49.3K engagements

"🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes Top Models: - #1. Recraft V3 - #2. Ideogram [---] - #3. FLUX1.1 pro - #3. Luma Photon - #5. DALLE [--] - #5. FLUX.1 dev - #7. Stable Diffusion [---] Large Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots Which model youd like to see next Comment to let us know. More analysis and leaderboard link below👇 Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX Stable Diffusion Dall-E Recraft Ideogram and more Who will reign supreme [--]. Describe your"
X Link 2025-01-06T17:23Z 62.1K followers, 66.7K engagements

"The Overall leaderboard combines both User Prompts and Pre-generated Prompts which we provide breakdown in category. In User Prompts Only breakdown (74% of total votes) Ideogram [---] jumps to #1"
X Link 2025-01-06T17:23Z 60.9K followers, [----] engagements

"Interestingly model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop. Our preset prompts are taken from: https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1 https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1"
X Link 2025-01-06T17:23Z 60.9K followers, [----] engagements

"We're excited to announce that we'll be collaborating with @togethercompute to make the best LLM coding evals together We'll improve WebDev Arena by adding new modality and complex coding agent that better assess LLM's the real world coding capabilities. Also we thank Together AI for kindly sponsoring usage for open models on WebDev Arena"
X Link 2025-01-08T17:13Z 61.1K followers, 20.1K engagements

"Curious about how people use Chatbot Arena Introducing Arena Explorer🔎: a brand-new way to visualize data from Chatbot Arena With topic modeling pipeline we organize user prompts into: - Big categories like Programming Creative Writing and Tech Concepts - Sub-categories like Python/React or Lyrics Writing Check out the demo below👇"
X Link 2025-01-15T17:29Z 63.8K followers, 12.1K engagements

"You can hover to see different categories and click to show specific examples"
X Link 2025-01-15T17:29Z 61.2K followers, [---] engagements

"Most popular prompt categories asked by Arena users: - Programming Skills - Technical Concepts - Creative Writing"
X Link 2025-01-15T17:29Z 61.2K followers, [----] engagements

"Check out DeepSeek-R1 on LiveCodeBench DeepSeek-R1 (Preview) Results 🔥 We worked with the @deepseek_ai team to evaluate R1 Preview models on LiveCodeBench. The model performs in the vicinity of o1-Medium providing SOTA reasoning performance Huge kudos to the team and I'm looking forward to the full release /1 https://t.co/tfyK8ni5dh DeepSeek-R1 (Preview) Results 🔥 We worked with the @deepseek_ai team to evaluate R1 Preview models on LiveCodeBench. The model performs in the vicinity of o1-Medium providing SOTA reasoning performance Huge kudos to the team and I'm looking forward to the full"
X Link 2025-01-17T18:31Z 62K followers, 15.4K engagements

"DeepSeek-R1 is now in the Arena🔥 Congrats @deepseek_ai on R1 release An open reasoning model matching OpenAI o1 in hard benchmarks like GPQA/SWE-Bench/AIME Now for the real-world challengeR1 is in for human evaluation. Bring your toughest prompts and challenge it http://lmarena.ai http://lmarena.ai http://lmarena.ai http://lmarena.ai"
X Link 2025-01-20T18:40Z 64K followers, 22.9K engagements

"Breaking news from Text-to-Image Arena 🖼✨ @GoogleDeepMinds Imagen [--] debuts at #1 surpassing Recraft-v3 with a remarkable +70-point lead Congrats to the Google Imagen team for setting a new bar Try the best text2image at LMArena and cast your vote More analysis👇"
X Link 2025-01-22T20:31Z 76.4K followers, 439.1K engagements

"Thanks @KuittinenPetri We also observe that with more detailed input descriptions Flux models performs better. You may check the "pre-generated prompts" category where FLUX DALLE. We also find that FLUX may not perform the best with non-English prompts. We are studying more and please let us know if you have more feedback"
X Link 2025-01-22T21:04Z 62.6K followers, [---] engagements

"In Hard Prompt with Style Control DeepSeek-R1 ranked joint #1 with o1"
X Link 2025-01-24T11:19Z 64.3K followers, 20K engagements

"Early results show DeepSeek-R1 strong across all domains More votes are being collected for stable rankings"
X Link 2025-01-24T11:19Z 64K followers, 14.5K engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing