@ggerganov Georgi GerganovGeorgi Gerganov posts on X about llamacpp, inference, ai, the new the most. They currently have [------] followers and [---] posts still getting attention that total [---] engagements in the last [--] hours.
Social category influence technology brands stocks finance products
Social topic influence llamacpp #34, inference, ai, the new, llama, this is, cpu, in the, pro, project
Top accounts mentioned or mentioned by @huggingface @ngxson @nvidiaaipc @apple @ollama @abetlen @yavorgi @evmills @teknium1 @starfoom @philpax @reachvb @simonw @unslothai @erusev @antoniostoilkov @flipperzero @rgerganov @fdsphilippe @lofi911
Top posts by engagements in the last [--] hours
"Near real-time audio transcription from the microphone using a custom C/C++ inference of #OpenAI Whisper ASR model. Runs on MacBook M1 Pro (CPU only) https://github.com/ggerganov/whisper.cpp https://github.com/ggerganov/whisper.cpp"
X Link 2022-10-02T16:35Z 43.5K followers, [--] engagements
"My [--] cents on the #ChatGPT is often wrong criticism:"
X Link 2022-12-03T13:11Z 40.7K followers, [--] engagements
"@abetlen I haven't fully caught up with the "guidance" idea: - Do I understand correctly that the "guidance" adjusts / constrains the logits for the next token based on the provided program - Your notebook seems to be invoking "text-davinci-003" but it seems to run locally - how come"
X Link 2023-05-19T07:57Z 40K followers, [----] engagements
"RFID Emulation is also available - perfect application for conference / social events (@flipper_zero reader demo)"
X Link 2023-07-12T19:09Z 35.2K followers, [----] engagements
"ggtag is fully open-source and open-hardware Hope you have as much fun using ggtag as we did while making it Huge shoutout to @rgerganov for pulling the entire project together. Make sure to checkout the repo for more details:"
X Link 2023-07-12T19:09Z 34.2K followers, [----] engagements
"Basic Vim plugin for llama.cpp Far from usable but promising initial steps https://github.com/ggerganov/llama.cpp/pull/2495 https://github.com/ggerganov/llama.cpp/pull/2495"
X Link 2023-08-08T11:43Z 52.1K followers, 24.1K engagements
"Similar test with M2 Ultra using vanilla 70B LLaMA Prompt processing is currently unoptimized but just the text generation yields [--] tok/s. This is Q4_0 quantization (39GB) using Metal"
X Link 2023-08-08T16:01Z 35.2K followers, 53.1K engagements
"Here are a couple of more examples of speculative sampling"
X Link 2023-08-31T14:59Z 35.2K followers, 19K engagements
"@FdsPhilippe @LOFI911 @abetlen Yeah you likely cannot run this setup on anything else other than Ultra. However you can do the same trick with F16 13B + Q4 7B and still get some boost (1.5x) on a 64GB MacBook (I think)"
X Link 2023-08-31T16:17Z 40.7K followers, [---] engagements
"@thelokasiffers Yes you need both models in memory. Quantization helps to reduce the memory footprint of the draft model and also makes it faster"
X Link 2023-08-31T20:02Z 35.9K followers, [--] engagements
"This is using an F16 30B LLaMA v1 target model + a 7B quantum draft model Let me know if you have ideas for some interesting applications - summarization extraction etc. I'm curious if this approach is viable"
X Link 2023-09-03T15:38Z 35.9K followers, [----] engagements
"@KennethCassel @YavorGI Speed is not there yet but lots of room for improvement. Should by OK an Apple Silicon I think"
X Link 2023-09-05T16:29Z 34.2K followers, [----] engagements
"π€¦ Forgot to push the latest version so in case you were already testing and wondering why there is still CPU usage - git pull"
X Link 2023-09-12T21:13Z 34.2K followers, [----] engagements
"After a few more optimizations Metal is now faster than Core ML across the board Most of the speed-up here actually came from the work in llama.cpp. Nice to see everything coming together so nicely π"
X Link 2023-09-13T17:59Z 34.2K followers, [----] engagements
"This is probably a long shot but maybe some Core ML / Metal experts can give a hint I'm comparing a Core ML (GPU) vs custom Metal (GPU) implementation of the Whisper encoder and on the first run they have similar performance but on next runs Core ML takes x2 advantage"
X Link 2023-09-13T20:21Z 32.3K followers, 34.7K engagements
"Have a few thoughts about this approach But most importantly I'm happy to see @EvMill's idea on softmax1 recognized - to my very basic and intuitive understanding of LLMs it made enough sense to warrant further analysis"
X Link 2023-10-02T17:53Z 35.2K followers, 13.5K engagements
"Here are some Apple Silicon stats for llama.cpp You can use these numbers to estimate the performance that you would get on your Mac for typical bs=1 cases"
X Link 2023-11-24T15:39Z 34.2K followers, 69.2K engagements
"@Teknium1 The grammar is currently evaluated on the CPU in a single thread - certainly room for optimizations. There a couple of PRs with improvements already pending I'm bullish on llama.cpp grammar"
X Link 2023-11-25T16:48Z 40K followers, [----] engagements
"@Starfoom Not sure what is lambda but most likely it can π"
X Link 2023-11-27T20:15Z 34.1K followers, [---] engagements
"@sidneyvanness @Starfoom Apologies for not getting back I have almost [--] experience with cloud infra (this was my first AWS instance ever) Maybe the community can help - try opening a discussion on Github"
X Link 2023-11-27T20:24Z 34.1K followers, [---] engagements
"@tairov If I'm reading correctly their pricing 13B LLaMA provisioned for [--] month is $15k You can run quantum 13B on the g4dn.xlarge instance above for $500/month"
X Link 2023-11-27T21:07Z 34.1K followers, [----] engagements
"Here is a quick demo with the 4-bit base model on M2 Ultra You can run this on your MacBooks - the quantized models are between 15GB 50GB. The prompt processing speed is subpar atm but we will improve this"
X Link 2023-12-11T14:46Z 35.2K followers, 35.6K engagements
"For people asking - llama.cpp has been already deployable on iOS for quite some time now Check the llama.swiftui example in the llama.cpp repo for a starting point: https://github.com/ggerganov/llama.cpp/pull/4483 https://github.com/ggerganov/llama.cpp/pull/4483"
X Link 2023-12-15T16:29Z 40.9K followers, 12K engagements
"@jordibruin @om How much faster is it according to your tests"
X Link 2023-12-16T11:02Z 35.1K followers, [---] engagements
"@vagero4 @INSAITinstitute I know some vague details - think the team will present more info next month. Signed up for early access"
X Link 2024-01-15T15:52Z 35.9K followers, [---] engagements
"@Teknium1 @philpax_ Planning to add models such as RWKV T5. Already have Mamba support. The focus is mostly LLMs. Stable diffusion is unlikely to be supported in llama.cpp. I'm not familiar with classifiers yet"
X Link 2024-03-16T11:25Z 36.9K followers, [----] engagements
"llama.cpp is currently sitting at [--] t/s"
X Link 2024-04-04T16:25Z 37.8K followers, 17.3K engagements
"A community space running llama.cpp on HF infra Very cool https://huggingface.co/spaces/gokaygokay/Gemma-2-llamacpp https://huggingface.co/spaces/gokaygokay/Gemma-2-llamacpp"
X Link 2024-07-05T10:24Z 51.3K followers, 25.4K engagements
"Highlighting project paddler - a load-balancer for llama.cpp servers by mcharytoniuk@gh https://github.com/distantmagic/paddler https://github.com/distantmagic/paddler"
X Link 2024-07-10T15:02Z 39.8K followers, [----] engagements
"Alan Gray (NVIDIA) writes about implementing CUDA graphs in llama.cpp: https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/ https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/"
X Link 2024-08-08T10:37Z 40.7K followers, 14.9K engagements
"No iPhone Mirroring in the region. Apparently not available in EU"
X Link 2024-09-18T10:01Z 40.6K followers, 12K engagements
"@reach_vb @bartowski1182 @MaziyarPanahi I really dig their quant queue status page"
X Link 2024-09-20T15:10Z 40.6K followers, [----] engagements
"Let's bring llama.cpp to the clouds You can now run llama.cpp-powered inference endpoints through Hugging Face with just a few clicks. Simply select a GGUF model pick your cloud provider (AWS Azure GCP) a suitable node GPU/CPU and you are good to go. For more info check the short video below by @ngxson Massive thanks to @huggingface for making this possible and for being supper supportive of the llama.cpp project Looking forward to improving this functionality in the future by adding more features supporting more hardware and improving the overall performance. Any feedback is highly"
X Link 2024-09-27T15:56Z 40.6K followers, 46.4K engagements
"Technical blog by @nvidia about using llama.cpp on RTX-powered systems: "More than [--] tools and apps are now accelerated with llama.cpp on the RTX platform" Nice If you are one of these tools give us a shoutout :) https://developer.nvidia.com/blog/accelerating-llms-with-llama-cpp-on-nvidia-rtx-systems/ https://developer.nvidia.com/blog/accelerating-llms-with-llama-cpp-on-nvidia-rtx-systems/"
X Link 2024-10-02T13:35Z 40.7K followers, 23.4K engagements
"The PocketPal AI mobile app (iOS and Android) was open sourced yesterday. The project looks fun and seems like a useful playground for LLMs at the edge. Looking forward to where the community takes it from here. Shoutout to the author @ghorbani_asghar https://github.com/a-ghorbani/pocketpal-ai https://github.com/a-ghorbani/pocketpal-ai"
X Link 2024-10-22T07:05Z 40.6K followers, 15.9K engagements
"The HFChat macOS app was just open sourced. llama.cpp inside for local inference through Swift bindings (i.e. no intermediate REST server and API). Will be playing with this tomorrow The source code for HFChat macOSπ€is now fully open source and accepting PRs Looking forward to see what folks will build. You'll also find some hidden features that never made it to the release: https://t.co/804uBWU1Nm https://t.co/8gvVvyB8DY The source code for HFChat macOSπ€is now fully open source and accepting PRs Looking forward to see what folks will build. You'll also find some hidden features that never"
X Link 2024-10-23T21:24Z 52.1K followers, 34.1K engagements
"HuggingFace Inference Endpoints now supports deploying llama.cpp-powered instances on CPU servers too This is a first step towards a wider low-cost cloud LLM availability especially with new AI-friendly instruction sets on the rise. More info: https://x.com/ngxson/status/1861726639287050505 Hugging Face inference endpoints now support CPU deployment for llama.cpp π π Why this is a huge deal Llama.cpp is well-known for running very well on CPU. If you're running small models like Llama 1B or embedding models this will definitely save tons of money π° π° https://t.co/YlVbwUCyQH"
X Link 2024-11-27T16:45Z 40.9K followers, 19.4K engagements
"@simonw Yes indeed. Personally I'm sticking with Qwen 32B Q8 for now. Btw if you are not using speculative decoding you are missing out"
X Link 2024-12-09T16:12Z 41K followers, 19.5K engagements
"This is the command I'm using: llama-server -m ./models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf -md ./models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf --log-file ./service-chat.log --port [----] --ctx-size [--] --cache-reuse [---] -ub [----] -b [----] -ngl [--] -ngld [--] -fa -dt [---] -lv [--] -t [--] --draft-max [--] --draft-min [--] End then make sure to set Top-K = [--] in the UI settings (very important). Let me know what you think"
X Link 2024-12-09T16:35Z 41K followers, [----] engagements
"The cool thing about these plugins is that the entire logic is extremely simple. The plugin essentially maintains a ring buffer of chunks from your code and sends [--] types of requests to the server. Here is the full vimscript implementation - probably the AGI can one-shot translate this for other IDEs too π€ https://github.com/ggml-org/llama.vim/blob/master/autoload/llama.vim llama.vscode (powered by Qwen Coder) https://t.co/kt48aadRVc https://github.com/ggml-org/llama.vim/blob/master/autoload/llama.vim llama.vscode (powered by Qwen Coder) https://t.co/kt48aadRVc"
X Link 2025-01-23T17:57Z 43.4K followers, 22K engagements
"And while I have your attention help us find a workaround for this weird macOS behaviour with the GPU / Metal. It would greatly improve the local LLM experience on @apple devices: https://github.com/ggerganov/llama.cpp/pull/10119 https://github.com/ggerganov/llama.cpp/pull/10119"
X Link 2025-01-24T15:01Z 43.5K followers, 21.4K engagements
"PSA: make sure to update your brew llama.cpp packages to enjoy major performance improvement for your llama.vim and llama.vscode FIMs brew install llama.cpp https://github.com/ggerganov/llama.cpp/pull/11427 https://github.com/ggerganov/llama.cpp/pull/11427 https://github.com/ggerganov/llama.cpp/pull/11427 https://github.com/ggerganov/llama.cpp/pull/11427"
X Link 2025-01-29T16:14Z 43.8K followers, 23.2K engagements
"llama.cpp RTX [--] benchmarks on the Distilled DeepSeek models Accelerate DeepSeek reasoning models with RTX AI PCs. Learn more here π https://t.co/UVVICyUIWn https://t.co/JIFlJu8l70 Accelerate DeepSeek reasoning models with RTX AI PCs. Learn more here π https://t.co/UVVICyUIWn https://t.co/JIFlJu8l70"
X Link 2025-01-31T20:33Z 43.6K followers, [----] engagements
"Best way to start the week - install llama.vscode More than [----] installs so far https://marketplace.visualstudio.com/itemsitemName=ggml-org.llama-vscode https://marketplace.visualstudio.com/itemsitemName=ggml-org.llama-vscode"
X Link 2025-02-03T07:17Z 51.3K followers, 17.5K engagements
"The llama.cpp backend has been officially merged in TGI https://huggingface.co/docs/text-generation-inference/en/backends/llamacpp https://huggingface.co/docs/text-generation-inference/en/backends/llamacpp"
X Link 2025-02-14T16:31Z 52.1K followers, 64.8K engagements
"whisper.cpp v1.7.5 is out"
X Link 2025-04-02T14:41Z 49.9K followers, 20.8K engagements
"π¦ My 2-day work: Llama [--] on llama.cpp - on the horizon I had more fun doing this than I initially expected :laughing: What I learnt while working on this Follow in :thread: https://t.co/8BLpFFVIXv My 2-day work: Llama [--] on llama.cpp - on the horizon I had more fun doing this than I initially expected :laughing: What I learnt while working on this Follow in :thread: https://t.co/8BLpFFVIXv"
X Link 2025-04-07T20:47Z 50K followers, [----] engagements
"PSA for applications that use local AI models - here is how to do it right: More and more applications are adding support for local AI models which is great. But I notice that they are doing it the wrong way (see the screenshots below). The right way to do it is to add a provider-agnostic option. For example: "Custom endpoint". And then all you have to do is let the user enter a custom URL (of course have one entered by default). That's all. Everything extra is making your job harder for no reason. The idea is to let the model management be performed by a specialized 3rd party app which would"
X Link 2025-05-22T16:20Z 50.4K followers, 25.8K engagements
"Congrats on the exciting release The new architecture changes to the model are definitely interesting and will be fun to play with. Admirations for the community initiative to build on-device products. Looking forward to many people joining the challenge Im so excited to announce Gemma 3n is here π πMultimodal (text/audio/image/video) understanding π€―Runs with as little as 2GB of RAM πFirst model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface @kaggle llama.cpp https://t.co/CNDy479EEv and more https://t.co/Xap0ymhCmr Im so excited to announce Gemma 3n is here π"
X Link 2025-06-26T17:12Z 50.6K followers, 15.3K engagements
"@simonw Did some experiments recently using Qwen3-30B-A3B (Q8_0) routed to Claude Code - I think it was a pretty good"
X Link 2025-07-24T10:10Z 50.8K followers, [----] engagements
"The work on supporting gpt-oss was done in collaboration with teams at @OpenAI @HuggingFace and @NVIDIA_AI_PC. At @ggml_org we truly appreciate the support and trust of these partners and we look forward to more future collaborations as we continue to bring AI to the edge"
X Link 2025-08-05T17:12Z 51.3K followers, [----] engagements
"OpenAIs New Open Models Accelerated Locally on NVIDIA GeForce RTX and RTX PRO GPUs https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss/ https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss/"
X Link 2025-08-05T17:12Z 51.3K followers, [----] engagements
"@ollama @ivanfioravanti I am sure you worked hard and did your best. But this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens Why is 16k total processing time less than 8k Please - this is not serious. Do not share this"
X Link 2025-08-06T16:53Z 51.7K followers, 23.2K engagements
"Nah you have it all wrong. ollama is "moving away from llama.cpp" - they only use it for "legacy models and CPU compatibility". The fact that this PR includes 2x faster GPU kernels and more than [--] model architectures that are not supported by ollama's new engine is completely accidental /s"
X Link 2025-08-10T19:31Z 51.7K followers, [---] engagements
"Feel free to chime in: https://github.com/ggml-org/llama.cpp/discussions/15313 https://github.com/ggml-org/llama.cpp/discussions/15313"
X Link 2025-08-14T13:59Z 51.8K followers, 11.4K engagements
"Firefox adds LLM capabilities for add-ons via llama.cpp https://www.firefox.com/en-US/firefox/142.0/releasenotes/ https://www.firefox.com/en-US/firefox/142.0/releasenotes/"
X Link 2025-08-19T16:38Z 52K followers, 23K engagements
"The llama-server is compatible with most of your favorite coding agents (Claude Code crush Cline etc.) See the "Tips" section at the end of the guide for more information"
X Link 2025-08-19T19:20Z 51.8K followers, [----] engagements
"llama.qtcreator is now part of ggml-org"
X Link 2025-08-20T15:04Z 51.8K followers, [----] engagements
"The project is created and maintained by Cristian Adam. Here is a blog post about how he created the plugin: https://cristianadam.eu/20250817/from-llama-dot-vim-to-qt-creator-using-ai/ https://cristianadam.eu/20250817/from-llama-dot-vim-to-qt-creator-using-ai/"
X Link 2025-08-20T15:04Z 51.8K followers, [----] engagements
"I just ran the gpt-oss eval suite with the large gpt-oss-120b on my M2 Ultra using vanilla llama.cpp and got the following scores: - GPQA: 79.8% - AIME25: 96.6% These numbers are inline with those from various cloud providers: Here are the steps: https://github.com/ggml-org/llama.cpp/discussions/15396#discussioncomment-14167264 https://github.com/ggml-org/llama.cpp/discussions/15396#discussioncomment-14167264 TIL you can run an eval suite against OpenAI's gpt-oss-20b open weights model running in LM Studio with the following uv one-liner: https://t.co/96rGqLrqAe"
X Link 2025-08-20T18:43Z 52.1K followers, 39.3K engagements
"Here are some M3 Ultra performance numbers for gpt-oss This weekend: it's gpt-oss time So I thought: let's test latest llama.cpp build [----] that is super fast on MoE and results are. π₯ππ₯ Look at 20b and 120b here https://t.co/kb0hNI1x3B This weekend: it's gpt-oss time So I thought: let's test latest llama.cpp build [----] that is super fast on MoE and results are. π₯ππ₯ Look at 20b and 120b here https://t.co/kb0hNI1x3B"
X Link 2025-08-29T03:54Z 52.1K followers, 10.7K engagements
"For llama.vim the recommended setup now is Qwen [--] Coder 30B A3B Instruct: brew install llama.cpp llama-server --fim-qwen-30b-default Amazingly on Macs the 30B MoE model performs better than the old Qwen [---] Coder 7B so if you have the necessary RAM it's better to switch to Qwen [--] https://github.com/ggml-org/llama.vim https://github.com/ggml-org/llama.vim"
X Link 2025-08-29T16:49Z 52.2K followers, 32.6K engagements
"More info in this PR https://github.com/microsoft/vscode/issues/249605 https://github.com/microsoft/vscode/issues/249605"
X Link 2025-09-03T15:01Z 52.1K followers, [----] engagements
"HuggingFace just shipped in-browser GGUF editing It allows you to edit GGUF metadata in the comfort of your browser without having to even download the full model. This feature is enabled via the Xet technology that makes partial file updates possible"
X Link 2025-10-07T14:45Z 52.2K followers, 36.4K engagements
"More info about Xet: https://huggingface.co/blog/xet-on-the-hub https://huggingface.co/blog/xet-on-the-hub"
X Link 2025-10-07T14:45Z 52.2K followers, [----] engagements
"Setting up NVIDIA DGX Spark with ggml I got to play for a few days with this new device and wrote a short guide about how to configure it for various local AI use cases. https://github.com/ggml-org/llama.cpp/discussions/16514 https://github.com/ggml-org/llama.cpp/discussions/16514"
X Link 2025-10-14T04:27Z 52.3K followers, 14.3K engagements
"NVIDIA engineer just contributed an update that improves the generation performance of llama.cpp by up to 40% on DGX Spark. All numbers in the post above have been updated https://github.com/ggml-org/llama.cpp/pull/16585 https://github.com/ggml-org/llama.cpp/pull/16585"
X Link 2025-10-15T16:00Z 52.2K followers, [----] engagements
"This is the prompt for anyone interested: https://github.com/ggerganov/whisper.cpp/blob/c353100bad5b8420332d2d451323214c84c66950/examples/talk.llama/talk-llama.cpp#L177-L194 https://github.com/ggerganov/whisper.cpp/blob/c353100bad5b8420332d2d451323214c84c66950/examples/talk.llama/talk-llama.cpp#L177-L194"
X Link 2023-03-26T16:31Z 53.3K followers, 126.1K engagements
"Casually running a 180B parameter LLM on M2 Ultra"
X Link 2023-09-07T14:26Z 53.3K followers, 688.2K engagements
"Some of llama.cpp's features https://github.com/ggerganov/llama.cpp/discussions/3471 https://github.com/ggerganov/llama.cpp/discussions/3471"
X Link 2023-10-18T20:57Z 53.3K followers, 57.9K engagements
"llama.cpp server now support multimodal (LLaVA) π Huge shoutout to FSSRepo and monatis https://www.reddit.com/r/LocalLLaMA/comments/17e855d/llamacpp_server_now_supports_multimodal/ https://www.reddit.com/r/LocalLLaMA/comments/17e855d/llamacpp_server_now_supports_multimodal/"
X Link 2023-10-23T07:44Z 53.3K followers, 119.3K engagements
"llama.cpp releases now ship with pre-built macOS binaries This should reduce the entry barrier for llama.cpp on Apple devices Thanks to @huggingface for the friendly support π"
X Link 2024-03-21T20:01Z 53.3K followers, 46.5K engagements
"whisper.cpp is coming to ffmpeg https://github.com/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b https://github.com/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b"
X Link 2025-08-11T19:30Z 53.3K followers, 179.8K engagements
"Apparently Stable Diffusion can be used to generate images of spectrograms from text prompts. The spectrograms can in turn be converted to audio using STFT and some tricks Mind is blown https://www.riffusion.com/about https://www.riffusion.com/about"
X Link 2022-12-15T15:52Z 53.3K followers, [---] engagements
"4-bit integer quantisation in whisper.cpp / ggml You can now run the Large Whisper model locally in a web page via WebAssembly SIMD https://whisper.ggerganov.com https://whisper.ggerganov.com"
X Link 2023-02-27T16:49Z 53.3K followers, 44K engagements
"I think I can make 4-bit LLaMA-65B inference run on a [--] GB M1 Pro π€ Speed should be somewhere around [--] tokens/sec. Is this useful for anything"
X Link 2023-03-05T16:47Z 53.3K followers, 293.2K engagements
"Progress update on adding Core ML support to whisper.cpp We can now run the small model with a 400ms time step quite efficiently thanks to evaluating the Encoder on the ANE"
X Link 2023-03-07T17:22Z 53.3K followers, 62K engagements
"Here is 4-bit inference of LLaMA-7B using ggml: Pure C/C++ runs on the CPU at [--] tokens/sec (M1 Pro) Generated text looks coherent but quickly degrades - not sure if I have a bug or something π€ Anyway LLaMA-65B on M1 coming soon https://github.com/ggerganov/llama.cpp https://github.com/ggerganov/llama.cpp"
X Link 2023-03-10T19:58Z 53.3K followers, 117K engagements
"Simultaneously running LLaMA-7B (left) + Whisper Small (right) on M1 Pro"
X Link 2023-03-10T22:30Z 53.3K followers, 285.6K engagements
"Just added support for all LLaMA models I'm out of disk space so if someone can give this a try for 33B and 65BB would be great π See updated instructions in the Readme Here is LLaMA-13B at [--] tokens/s I think I can make 4-bit LLaMA-65B inference run on a [--] GB M1 Pro π€ Speed should be somewhere around [--] tokens/sec. Is this useful for anything I think I can make 4-bit LLaMA-65B inference run on a [--] GB M1 Pro π€ Speed should be somewhere around [--] tokens/sec. Is this useful for anything"
X Link 2023-03-11T09:37Z 53.3K followers, 215.7K engagements
"Interactive chat mode added to π¦.cpp It actually works surprisingly well from the few tests that I tried Kindly contributed by GH user Blackhole89"
X Link 2023-03-12T21:24Z 53.3K followers, 80.8K engagements
"Here is what a properly built llama.cpp looks like Running 7B on [--] years old Pixel [--] at [--] token/sec. Would be interesting to see how an interactive session feels like Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov https://t.co/Hfs7GSaChC Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov https://t.co/Hfs7GSaChC"
X Link 2023-03-14T11:35Z 53.3K followers, 136.1K engagements
"I'm trying to figure out what this means Any ideas"
X Link 2023-03-14T13:37Z 53.3K followers, 114.2K engagements
"The llama.cpp repo is buzzing with activity today. Here are some highlights Added Alpaca model support and usage instructions"
X Link 2023-03-19T20:25Z 53.3K followers, 190.9K engagements
"Introducing LLaMA voice chat π¦ You can run this locally on an M1 Pro"
X Link 2023-03-26T16:06Z 53.3K followers, 1.7M engagements
"LLaMA voice chat + Siri TTS This example is now truly 100% offline since we are now using the built-in Siri text-to-speech available on MacOS through the "say" command"
X Link 2023-03-27T18:11Z 53.3K followers, 415.8K engagements
"llama.cpp just got access to the new Copilot for Pull Request technical preview by @github Just add tags like "copilot:all" / "copilot:summary" / "copilot:walkthrough" to your PR comment the magic happens πͺ"
X Link 2023-03-28T17:38Z 53.3K followers, 162.6K engagements
"I'm thinking about making an open-source local iOS voice chat app running Whisper Base + 4-bit Cerebras-GPT 2.7B. Should be able to run quite real-time on newer iPhones Pretty sure I have everything needed and can build this in a day. Only question is if Cerebras is good enough"
X Link 2023-03-28T20:56Z 53.3K followers, 116.7K engagements
"I'm color-coding Whisper tokens based on their probs -- green means confident. All models behave in a similar way (first [--] images) except for Large V2. The probs are all over the place (4th image) π€ Do I have a bug or is this model somehow unstable"
X Link 2023-03-29T14:46Z 53.3K followers, 79.4K engagements
"Announcing the Local LLaMA podcast ππ¦ In today's episode we have LLaMA GGaMA SSaMA and RRaMA joining us to discuss the future of AI"
X Link 2023-04-01T10:42Z 53.3K followers, 197.2K engagements
"Can't help but feel the AI hype is oriented in a non-optimal direction It's almost as if we had just discovered the FFT algorithm and instead of revolutionizing telecommunications we are using it to build Tamagotchis P.S. I'm only half joking π"
X Link 2023-04-11T11:37Z 53.3K followers, 98.1K engagements
"Will be cancelling my Github Copilot subscription soon π"
X Link 2023-04-11T18:44Z 53.3K followers, 114.6K engagements
"whisper.cpp v1.3.0 now with Core ML support Currently the Encoder runs on the ANE while the Decoder remains on the CPU. Check the linked PR [---] for implementation details and usage instructions https://github.com/ggerganov/whisper.cpp/releases/tag/v1.3.0 https://github.com/ggerganov/whisper.cpp/releases/tag/v1.3.0"
X Link 2023-04-15T15:11Z 53.3K followers, 175.8K engagements
"Initial low-rank adaptation support has been added to llama.cpp We now have the option to apply LoRA adapters to a base model at runtime. Lots of room for improvements and opens up possibilities for some interesting applications https://github.com/ggerganov/llama.cpp/pull/820 https://github.com/ggerganov/llama.cpp/pull/820"
X Link 2023-04-18T14:13Z 53.3K followers, 133.1K engagements
"New release: whisper.cpp v1.4 - Added 4-bit 5-bit and 8-bit integer quantization - Added partial GPU support via cuBLAS https://github.com/ggerganov/whisper.cpp/releases/tag/v1.4.0 https://github.com/ggerganov/whisper.cpp/releases/tag/v1.4.0"
X Link 2023-05-01T12:19Z 53.3K followers, 62K engagements
"Top quality post on r/LocalLLaMA today π
Btw great subreddit"
X Link 2023-05-23T21:01Z 53.3K followers, 66.2K engagements
"The plan for adding full-fledged GPU support in ggml is starting to take shape Today I finally finished the ggml computation graph export / import functionality and demonstrated a basic MNIST inference on the Apple Silicon GPU using Metal https://github.com/ggerganov/ggml/pull/108 https://github.com/ggerganov/ggml/pull/108"
X Link 2023-05-28T19:22Z 53.3K followers, 76.5K engagements
"The future of on-device inference is ggml + Apple Silicon You heard it here first Watching llama.cpp do [--] tok/s inference of the 7B model on my M2 Max with 0% CPU usage and using all [--] GPU cores. Congratulations @ggerganov This is a triumph. https://t.co/C6mn7jkMLb https://t.co/8tcnVN4wEb Watching llama.cpp do [--] tok/s inference of the 7B model on my M2 Max with 0% CPU usage and using all [--] GPU cores. Congratulations @ggerganov This is a triumph. https://t.co/C6mn7jkMLb https://t.co/8tcnVN4wEb"
X Link 2023-06-04T17:03Z 53.3K followers, 347.8K engagements
"2345 and 6-bit quantization methods are now available in llama.cpp Efficient inference implementation with ARM NEON AVX2 and CUDA - see sample numbers in the screenshots Big thanks to ikawrakow for this contribution More info: https://github.com/ggerganov/llama.cpp/pull/1684 https://github.com/ggerganov/llama.cpp/pull/1684"
X Link 2023-06-06T14:18Z 53.3K followers, 109.6K engagements
"I've started a company: From a fun side project just a few months ago ggml has now become a useful library and framework for machine learning with a great open-source community http://ggml.ai http://ggml.ai"
X Link 2023-06-06T16:31Z 53.3K followers, 959.3K engagements
"shower thought : drop the position embeddings rewrite the transformer using complex numbers encode the position information in the complex phase ref : see how MRI phase encoding works"
X Link 2023-06-23T06:13Z 53.3K followers, 162.8K engagements
"Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects"
X Link 2023-06-25T14:58Z 53.3K followers, 137.2K engagements
"whisper.cpp now supports @akashmjn's tinydiarize models These fine-tuned models offer experimental support for speaker segmentation by introducing special tokens for marking speaker changes https://github.com/ggerganov/whisper.cpp/pull/1058 https://github.com/ggerganov/whisper.cpp/pull/1058"
X Link 2023-07-04T16:47Z 53.3K followers, 69.5K engagements
"llama.cpp now supports distributed inference across multiple devices via MPI This is possible thanks to @EvMill's work. Looking for people to give this a try and attempt to run a 65B LLaMA on cluster of Raspberry Pis π https://github.com/ggerganov/llama.cpp/pull/2099 https://github.com/ggerganov/llama.cpp/pull/2099"
X Link 2023-07-10T16:17Z 53.3K followers, 125.5K engagements
"ggtag : data-over-sound is back Please checkout our latest geeky side project -- An e-paper badge that can be programmed with sound Here is how it works π"
X Link 2023-07-12T19:06Z 53.3K followers, 205.8K engagements
"Very cool experiment by @chillgates_ Distributed MPI inference using llama.cpp with [--] Raspberry Pis - each one with 8GB RAM "sees" 1/6 of the entire 65B model. Inference starts around 1:10 Follow the progress here: https://github.com/ggerganov/llama.cpp/issues/2164 Yeah. I have ChatGPT at home. Not a silly 7b model. A full-on 65B model that runs on my pi cluster watch how the model gets loaded across the cluster with mmap and does round-robin inferencing π«‘ (10 seconds/token) (sped up 16x) https://t.co/Ns2lmhVgHT https://github.com/ggerganov/llama.cpp/issues/2164 Yeah. I have ChatGPT at home."
X Link 2023-07-16T14:39Z 53.3K followers, 114.9K engagements
"llama2.c running in a web-page Compiled with Emscripten and modified the code to predict one token per render pass. The page auto-loads 50MB of model data - sorry about that π https://ggerganov.com/llama2.c My fun weekend hack: llama2.c π¦π€ https://t.co/CUoF0l07oX Lets you train a baby Llama [--] model in PyTorch then inference it with one 500-line file with no dependencies in pure C. My pretrained model (on TinyStories) samples stories in fp32 at [--] tok/s on my MacBook Air M1 CPU. https://t.co/aBvKCf1t2u https://ggerganov.com/llama2.c My fun weekend hack: llama2.c π¦π€ https://t.co/CUoF0l07oX"
X Link 2023-07-23T17:56Z 53.3K followers, 269K engagements
"guys its real"
X Link 2023-07-29T16:09Z 53.3K followers, 282.6K engagements
"Lets see what this rock can do"
X Link 2023-08-07T14:47Z 53.3K followers, 211.5K engagements
"Here are some inference numbers for Code Llama on M2 Ultra at different quantum levels using latest llama.cpp pp - prompt processing tg - text generation Code Llama 7B"
X Link 2023-08-24T18:15Z 53.3K followers, 256.3K engagements
"ROCm support in llama.cpp [--] months community effort enables AMD devices to run quantum LLMs with high efficiency. Really great to see the strong collaboration in this work https://github.com/ggerganov/llama.cpp/pull/1087 https://github.com/ggerganov/llama.cpp/pull/1087"
X Link 2023-08-25T20:25Z 53.3K followers, 62.7K engagements
"Full F16 precision 34B Code Llama at [--] t/s on M2 Ultra"
X Link 2023-08-31T14:58Z 53.3K followers, 1.2M engagements
"The ggml roadmap is progressing as expected with a lot of infrastructural development already completed We now enter the more interesting phase of the project - applying the framework to practical problems and doing cool stuff on the Edge Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects https://t.co/XBshL1Ca73 Took the time to prepare a ggml development roadmap in the form of a Github Project This"
X Link 2023-08-31T18:31Z 53.3K followers, 106.6K engagements
"Experimenting with speculative decoding + grammar sampling This is an example of summarizing a short story into a structured JSON. We again utilize speculative decoding but this time we constrain the output using a JSON grammar to achieve 95% token acceptance rate"
X Link 2023-09-03T15:38Z 53.3K followers, 99.2K engagements
"sam.cpp π Inference of Meta's Segment Anything Model on the CPU Project by @YavorGI - powered by http://ggml.ai http://ggml.ai"
X Link 2023-09-05T16:09Z 53.3K followers, 267.5K engagements
"Full GPU Metal inference with whisper.cpp This is the Medium model on M2 Ultra greedy decoding"
X Link 2023-09-12T18:41Z 53.3K followers, 59K engagements
"Initial tests with parallel decoding in llama.cpp A simulated server processing [--] client requests with [--] decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 [----] tokens (994 prompt + [----] gen) with system prompt of [---] tokens in 46s"
X Link 2023-09-19T21:33Z 53.3K followers, 64.1K engagements
"M2 Ultra serving Q8_0 LLaMA-v2 70B to [--] clients in parallel"
X Link 2023-10-08T19:40Z 53.3K followers, 109.8K engagements
"π What is this black magic"
X Link 2023-10-10T17:30Z 53.3K followers, 310.2K engagements
"Serving [--] clients in parallel on A100 with llama.cpp Model: Codellama 7B F16 System prompt: [---] tokens Requests: [---] Max sequence length: [---] Continuous batching: enabled Average speed [---] t/s (including prompts and generated tokens)"
X Link 2023-10-24T08:46Z 53.3K followers, 93.1K engagements
"llama.cpp is standing ground against the behemoths The CUDA backend is contained in a single C++ file so it allows for very easy deployment and custom modifications (pp - prefill tg - text gen) Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. [---] tokens/sec int8 and [--] tok/sec bf16 for llama-2 7B models (not much work to setup either) Get 160+ tokens/sec on 2x3090s (these are just batch_size=1) https://t.co/1pnVTdFlbX Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. [---] tokens/sec"
X Link 2023-11-02T10:17Z 53.3K followers, 78K engagements
"whisper.cpp v1.5.0 https://github.com/ggerganov/whisper.cpp/releases/tag/v1.5.0 https://github.com/ggerganov/whisper.cpp/releases/tag/v1.5.0"
X Link 2023-11-15T21:13Z 53.3K followers, 82K engagements
"Native whisper.cpp server with OAI-like API is now available $ make server && ./server This is a very convenient way to run an efficient local transcription service locally on any kind of hardware (CPU GPU (CUDA or Metal) or ANE) thx felrock https://github.com/ggerganov/whisper.cpp/pull/1380 https://github.com/ggerganov/whisper.cpp/pull/1380"
X Link 2023-11-21T07:11Z 53.3K followers, 174.2K engagements
"Here is how to deploy and serve any LLM on HF with a single command in less than [--] minutes with llama.cpp $ bash -c "$(curl -s https://ggml.ai/server-llm.sh https://ggml.ai/server-llm.sh"
X Link 2023-11-21T17:54Z 53.3K followers, 144.6K engagements
"Very clever stuff Will be adding a llama.cpp example soon Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: https://t.co/J4VvXgaBmm Code: https://t.co/EgYNtFXkjH https://t.co/3eoRGcYf2S Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog:"
X Link 2023-11-21T22:33Z 53.3K followers, 81.1K engagements
"Wrote a short tutorial for setting up llama.cpp on AWS instances For example you can use one of the cheapest 16GB VRAM (NVIDIA T4) instances to serve a quantum Mistral 7B model to multiple clients in parallel with full context. Hope it is useful https://github.com/ggerganov/llama.cpp/discussions/4225 https://github.com/ggerganov/llama.cpp/discussions/4225"
X Link 2023-11-27T20:13Z 53.3K followers, 207K engagements
"ggml will soon run on billion devices @apple don't sleep on it π I just verified this on my Pixel [--] Pro phone It has AICore included and it is using ggml https://t.co/JeBpSnNIPk I just verified this on my Pixel [--] Pro phone It has AICore included and it is using ggml https://t.co/JeBpSnNIPk"
X Link 2023-12-09T15:28Z 53.3K followers, 234K engagements
"Adding support for the new Mixtral models Runs on CPU CUDA and Metal with quantization support and partial GPU offloading. Very interesting architecture to play with https://github.com/ggerganov/llama.cpp/pull/4406 https://github.com/ggerganov/llama.cpp/pull/4406"
X Link 2023-12-11T14:43Z 53.3K followers, 164.1K engagements
"Running some LLM benches on iPhone [--] Mini This is 1.1B TinyLlama. Speed looks quite reasonable. Wonder what would be some cool applications that we can try out π€ P.S. Forget about useless chat bots - we want something else. Think grammar function calling etc"
X Link 2023-12-15T15:16Z 53.3K followers, 140.6K engagements
"Run @Google's Gemma Open Models with llama.cpp https://github.com/ggerganov/llama.cpp/pull/5631 https://github.com/ggerganov/llama.cpp/pull/5631"
X Link 2024-02-21T13:19Z 53.3K followers, 68.9K engagements
"The GGUF file format is a great example of the cool things that an open-source community can achieve. Props to @philpax_ and everyone else involved in the design and implementation of the format. I'm thankful and happy to see that it finds adoption in ML https://github.com/ggerganov/ggml/pull/302 At @huggingface we are adding more support to GGUF (model format by @ggerganov). The number of GGUF models on the hub has been exploding & doesn't look like it is gonna slow downπ₯ see more at: https://t.co/eZCCUsevEn https://t.co/0tRE2z3Xsb https://github.com/ggerganov/ggml/pull/302 At @huggingface"
X Link 2024-03-15T11:54Z 53.3K followers, 74.1K engagements
"Causally running Grok-1 at home"
X Link 2024-03-22T20:31Z 53.3K followers, 279.1K engagements
"Challenge accepted π Achievement unlocked: [---] tokens-per-sec 4-bit Mistral 7B in MLX on an M2 Ultra https://t.co/nmNz5Kph2P Achievement unlocked: [---] tokens-per-sec 4-bit Mistral 7B in MLX on an M2 Ultra https://t.co/nmNz5Kph2P"
X Link 2024-04-04T16:19Z 53.3K followers, 104.6K engagements
"GGUF My Repo by @huggingface Create quantum GGUF models fully online - quickly and secure. Thanks to @reach_vb @pcuenq and team for creating this HF space In the video below I give it a try to create a quantum 8-bit model of Gemma 2B - it took about [--] seconds. The resulting model becomes automatically available in your HF profile and is ready to be used with llama.cpp https://huggingface.co/spaces/ggml-org/gguf-my-repo https://huggingface.co/spaces/ggml-org/gguf-my-repo"
X Link 2024-04-05T17:48Z 53.3K followers, 65.3K engagements
"llama.cpp is now in Homebrew Core πΊ"
X Link 2024-05-28T18:38Z 53.3K followers, 200.8K engagements
"LBDL + llama for scale. thx @francoisfleuret"
X Link 2024-05-31T19:20Z 53.3K followers, 38.7K engagements
"Llama [---] 3B & 1B GGUF https://huggingface.co/collections/hugging-quants/llama-32-3b-and-1b-gguf-quants-66f43204a559009763c009a5 https://huggingface.co/collections/hugging-quants/llama-32-3b-and-1b-gguf-quants-66f43204a559009763c009a5"
X Link 2024-09-25T17:50Z 53.3K followers, 94.8K engagements
"llama.vim : Neovim plugin for local text completion (powered by llama.cpp)"
X Link 2024-10-21T15:05Z 53.3K followers, 66.1K engagements
"ggml inference tech making its way into this weeks @apple M4 announcements is a great testament to this. IMO Apple Silicon continues to be the best consumer-grade hardware for local AI applications. For next year they should move copilot on-device. The future of on-device inference is ggml + Apple Silicon You heard it here first The future of on-device inference is ggml + Apple Silicon You heard it here first"
X Link 2024-11-02T07:58Z 53.3K followers, 59.9K engagements
"Open the pod bay doors HAL"
X Link 2024-12-19T18:39Z 53.3K followers, 44.4K engagements
"Make your Mac think π§ Tomorrow I'll show you how to enable speculative decoding for extra speed"
X Link 2025-01-20T20:58Z 53.3K followers, 238.4K engagements
"Make your Mac think faster π§ π§ Tomorrow I'll show you how to cancel your copilot subscription. Make your Mac think π§ Tomorrow I'll show you how to enable speculative decoding for extra speed. https://t.co/Tf3H4DNxoH Make your Mac think π§ Tomorrow I'll show you how to enable speculative decoding for extra speed. https://t.co/Tf3H4DNxoH"
X Link 2025-01-21T16:04Z 53.3K followers, 226K engagements
"llama.vscode (powered by Qwen Coder) Make your Mac think faster π§ π§ Tomorrow I'll show you how to cancel your copilot subscription. https://t.co/HfqdhJ31TB Make your Mac think faster π§ π§ Tomorrow I'll show you how to cancel your copilot subscription. https://t.co/HfqdhJ31TB"
X Link 2025-01-22T17:00Z 53.3K followers, 78.4K engagements
"pack it up boys it's over"
X Link 2025-01-27T14:41Z 53.3K followers, 716.4K engagements
"jk we are just starting https://github.com/ggerganov/llama.cpp/pull/11453 https://github.com/ggerganov/llama.cpp/pull/11453"
X Link 2025-01-27T14:42Z 53.3K followers, 59.4K engagements
"Here is the most cost effective way to deploy (the real) R1 in the cloud - use llama.cpp-powered inference endpoints @huggingface The @UnslothAI quantizations fit neatly in 4x L40S (max [----] ctx for now) Use the following link: https://endpoints.huggingface.co/newrepository=unsloth%2FDeepSeek-R1-GGUF&vendor=aws®ion=us-east-1&accelerator=gpu&instance_id=aws-us-east-1-nvidia-l40s-x4&task=text-generation&no_suggested_compute=true&env_LLAMA_ARG_CACHE_TYPE_K=q8_0&env_LLAMA_ARG_UBATCH=64"
X Link 2025-01-27T19:32Z 53.3K followers, 57.8K engagements
"CPU is all you need Complete hardware + software setup for running Deepseek-R1 locally. The actual model no distillations and Q8 quantization for full quality. Total cost $6000. All download and part links below: Complete hardware + software setup for running Deepseek-R1 locally. The actual model no distillations and Q8 quantization for full quality. Total cost $6000. All download and part links below:"
X Link 2025-01-28T20:53Z 53.3K followers, 39.2K engagements
"DeepSeek-R1 on Mac Studio 192GB πͺ"
X Link 2025-01-28T21:49Z 53.3K followers, 261.6K engagements
"this is what a happy llama.cpp user looks like"
X Link 2025-02-03T19:13Z 53.3K followers, 22.6K engagements
"Today I was sent the following cool demo: Two AI agents on a phone call realize theyre both AI and switch to a superior audio signal ggwave"
X Link 2025-02-24T16:11Z 53.3K followers, 17.6M engagements
"The developers (Anton and Boris) used the ggwave library to make the AIs communicate faster over a phone call https://github.com/ggerganov/ggwave https://github.com/ggerganov/ggwave"
X Link 2025-02-24T16:11Z 53.3K followers, 1.1M engagements
"The "gibberlink" project is hosted here: They won 1st place in a hackaton competition Congrats https://github.com/PennyroyalTea/gibberlink https://github.com/PennyroyalTea/gibberlink"
X Link 2025-02-24T16:11Z 53.3K followers, 594.5K engagements
"You can actually decode the audio messages from the video above using the waver webpage: https://waver.ggerganov.com https://waver.ggerganov.com"
X Link 2025-02-24T18:56Z 53.3K followers, 444.8K engagements
"Here is another Gibberlink experiment: Two AI agents autonomously encrypt their audio chat (video by Anton Pidkuiko)"
X Link 2025-03-03T16:02Z 53.3K followers, 274.2K engagements
"Google is taking local AI to the next level with Gemma [--] QAT GGUF models Uncompromised quality with quantization-aware training and uncompromised on-device performance with ggml. This is the way"
X Link 2025-04-03T18:08Z 53.3K followers, 118.4K engagements
"Retweet if you want llama.cpp added here π In preview Copilot Pro and Copilot Free users can now bring your own key (BYOK) for popular providers such as Anthropic Gemini Ollama and Open Router. This allows you to use new models that arent supported natively by Copilot the very first day that theyre released. https://t.co/nAJy6JzTMJ In preview Copilot Pro and Copilot Free users can now bring your own key (BYOK) for popular providers such as Anthropic Gemini Ollama and Open Router. This allows you to use new models that arent supported natively by Copilot the very first day that theyre"
X Link 2025-04-08T17:59Z 53.3K followers, 48.2K engagements
"New account for ggml news and notable PRs https://t.co/rw5hHy3PQu https://t.co/rw5hHy3PQu"
X Link 2025-04-23T14:16Z 53.3K followers, 58.1K engagements
"Son has been doing an outstanding job at maintaining the llama-server implementation and now bringing full-blown vision input support to llama.cpp Massive kudos and thanks for your valuable contributions to the project Vision support now available on llama.cpp server and Web UI More details in π§΅ https://t.co/X0Iwx1K4MY Vision support now available on llama.cpp server and Web UI More details in π§΅ https://t.co/X0Iwx1K4MY"
X Link 2025-05-10T06:53Z 53.3K followers, 26.8K engagements
"LlamaBarn (sneak peek π)"
X Link 2025-06-23T16:41Z 53.3K followers, 172.7K engagements
"AMD teams contributing to the llama.cpp codebase. Great support from the community with the review process. Exciting to see this open-source collaboration https://t.co/TRutvkkVYU https://t.co/TRutvkkVYU"
X Link 2025-07-28T18:59Z 53.3K followers, 35.4K engagements
"Llama.cpp supports the new gpt-oss model in native MXFP4 format The ggml inference engine (powering llama.cpp) can run the new gpt-oss model with all major backends including CUDA Vulkan Metal and CPU at exceptional performance. This virtually brings the unprecedented quality of gpt-oss in the hands of everyone - from local AI enthusiasts to enterprises doing inference at the edge or in the cloud. The unique inference capabilities of ggml unlock a vast amount of use cases for the entire spectrum of consumer-grade hardware available on the market today - use cases that are impossible to"
X Link 2025-08-05T17:12Z 53.3K followers, 115.1K engagements
"LMStudio are using the upstream ggml implementation which is significantly better and well optimized. Looking at ollama's modifications in ggml they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies I expect the performance is going to be quite bad in ollama. Why @ollama gpt-oss:20b version is too slow compared to the LM Studio version Any issue Why @ollama gpt-oss:20b version is too slow compared to the LM Studio version Any issue"
X Link 2025-08-06T13:37Z 53.3K followers, 305.4K engagements
"The ultimate guide for using gpt-oss with llama.cpp - Runs on any device - Supports NVIDIA Apple AMD and others - Support for efficient CPU offloading - The most lightweight inference stack today https://github.com/ggml-org/llama.cpp/discussions/15396 https://github.com/ggml-org/llama.cpp/discussions/15396"
X Link 2025-08-19T15:06Z 53.3K followers, 144.2K engagements
"llama.cpp is on fire today It's amazing to see this open-source collaboration getting stronger every day"
X Link 2025-08-21T16:26Z 53.3K followers, 44.6K engagements
"gpt-oss is a great model IMO OpenAI showed us the blueprint for winning local AI: - Interleaved SWA - Small head sizes in the attention - Attention sinks - Mixture of Experts FFN - 4-bit training All of these parts combined together result in the best architecture suitable for regular users. Very lightweight and efficient for inference on pretty much any hardware. Qwen models are also great. The MoE works really well. I think they should just adopt iSWA and 4-bit training to become the best. Gemma models are also great. They already have the 4-bit QAT figured out. It seems they just need to"
X Link 2025-08-28T14:18Z 53.3K followers, 83.4K engagements
"To run gpt-oss-20b on a 16GB Mac use these commands: brew install llama.cpp llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe [--] -fa -c [-----] --jinja --no-mmap Then open the browser at http://127.0.0.1:8080 Ollama just added a UI to its Mac app so you can run our open weights model while offline - super useful for train and plane rides. I previously used macai for this but this is way simpler to set up. I run gpt-oss:20b model on my MacBook Pro with [--] GB. It's a good model and https://t.co/ZVsoSBN6e0 Ollama just added a UI to its Mac app so you can run our open weights model while"
X Link 2025-08-28T18:37Z 53.3K followers, 91.5K engagements
"VS Code adds support for custom OAI-compatible endpoints This a big win for local AI as it allows us to use any local model provider without vendor lock-in. Big thanks to the VS Code devs and especially @IsidorN for listening to the community feedback and adding this option"
X Link 2025-09-03T15:01Z 53.3K followers, 44.6K engagements
"A detailed look into the new WebUI of llama.cpp"
X Link 2025-11-04T15:14Z 53.3K followers, 230.5K engagements
"Some neat QoL improvements coming to llama.cpp thanks to Johannes Gler https://github.com/ggml-org/llama.cpp/discussions/18049 https://github.com/ggml-org/llama.cpp/discussions/18049"
X Link 2025-12-15T08:33Z 53.3K followers, 13K engagements
"napkin math ahead: - buy [--] mac mini (200GB/s $1.2k each) - run LLAMA_METAL=1 LLAMA_MPI=1 for interleaved pipeline inference - deploy on-premise serve up to [--] clients in parallel at [--] t/s / 4-bit / 7B is this cost efficient energy wise thanks to @stanimirovb for idea"
X Link 2023-07-12T15:45Z 52.3K followers, 160.9K engagements
"The new Qwen with 256k context"
X Link 2025-01-26T22:18Z 44K followers, 28.1K engagements
"Gemma [--] support has been merged in llama.cpp https://github.com/ggml-org/llama.cpp/pull/12343 https://github.com/ggml-org/llama.cpp/pull/12343"
X Link 2025-03-12T09:10Z 53.1K followers, 21.9K engagements
"wllama is the WebAssembly bindings for llama.cpp - created and maintained by @ngxson https://github.com/ngxson/wllama https://github.com/ngxson/wllama"
X Link 2025-08-19T16:38Z 52.3K followers, [----] engagements
"Highlighting project paddler Build and scale your own LLM infrastructure powered by llama.cpp Mateusz and team worked hard over the last year and have significantly improved the project - check it out and let them know your experience https://github.com/intentee/paddler https://github.com/intentee/paddler https://github.com/intentee/paddler https://github.com/intentee/paddler"
X Link 2025-08-21T15:38Z 53.1K followers, 18K engagements
"Here are some long-awaited performance numbers for DGX Spark using llama.cpp https://github.com/ggml-org/llama.cpp/discussions/16578 https://github.com/ggml-org/llama.cpp/discussions/16578"
X Link 2025-10-14T14:32Z 52.3K followers, 40.2K engagements
"simple Important info. The issue in that benchmark seems to be ollama. Native llama.cpp works much better. Not sure how ollama can fail so hard to wrap llama.cpp. The lesson: Dont use ollama. Espacially not for benchmarks. https://t.co/z2Ue90b7ud Important info. The issue in that benchmark seems to be ollama. Native llama.cpp works much better. Not sure how ollama can fail so hard to wrap llama.cpp. The lesson: Dont use ollama. Espacially not for benchmarks. https://t.co/z2Ue90b7ud"
X Link 2025-10-15T15:14Z 52.3K followers, 28.3K engagements
"The new WebUI is also mobile friendly"
X Link 2025-11-04T15:14Z 52.9K followers, [----] engagements
"LlamaBarn v0.10.0 (beta) is out - feedback appreciated"
X Link 2025-11-05T14:07Z 53K followers, 35.1K engagements
"Initial M5 Neural Accelerators support in llama.cpp Enjoy faster TTFT in all ggml-based software (requires macOS Tahoe 26) https://github.com/ggml-org/llama.cpp/pull/16634 https://github.com/ggml-org/llama.cpp/pull/16634"
X Link 2025-11-06T16:40Z 53K followers, 27.8K engagements
"The new Mistral [--] models in llama.cpp"
X Link 2025-12-02T17:09Z 53.3K followers, 26.7K engagements
"We joined forces with NVIDIA to unlock high-speed AI inference on RTX AI PCs and DGX Spark using llama.cpp. The latest Ministral-3B models reach 385+ tok/s on @NVIDIA_AI_PC GeForce RTX [----] systems. Blog: https://developer.nvidia.com/blog/nvidia-accelerated-mistral-3-open-models-deliver-efficiency-accuracy-at-any-scale/ https://developer.nvidia.com/blog/nvidia-accelerated-mistral-3-open-models-deliver-efficiency-accuracy-at-any-scale/"
X Link 2025-12-02T19:01Z 53K followers, 32.1K engagements
"llama-cli -hf org/model"
X Link 2025-12-10T20:41Z 53.3K followers, 52.2K engagements
"In collaboration with NVIDIA the new Nemotron [--] Nano model is fully supported in llama.cpp Nemotron [--] Nano features an efficient hybrid Mamba MoE architecture. It's a promising model suitable for local AI applications on mid-range hardware. The large context window makes it a great choice for a variety of use cases and applications. The efficiency of llama.cpp and the unique context management features of the llama-server tool allows us to deploy and use this model on a wide-range of hardware. With recent code contributions by engineering teams at NVIDIA and open-source collaborators we can"
X Link 2025-12-15T14:33Z 53.3K followers, 27.3K engagements
"Recent contributions by NVIDIA engineers and llama.cpp collaborators resulting in significant performance gains for local AI"
X Link 2026-01-06T06:42Z 53.3K followers, 36K engagements
"The new WebUI in combination with the advanced backend capabilities of llama.cpp delivers the ultimate local AI chat experience It's fast private free and open-source It runs on any hardware - today Huge thanks to the team at @huggingface for initiating leading and supporting this work. Enjoy https://github.com/ggml-org/llama.cpp/discussions/16938 https://github.com/ggml-org/llama.cpp/discussions/16938"
X Link 2025-11-04T15:14Z 53.3K followers, 13K engagements
"Introducing LlamaBarn a tiny macOS menu bar app for running local LLMs Open source built on llama.cpp"
X Link 2026-01-29T16:31Z 53.3K followers, 53K engagements
"With 1x [----] and latest llama.cpp you will currently get north of 280t/s at empty context and 9000t/s prefill. Not sure about 24B Mistral. But one thing to keep in mind is that quantizing a BF16 Mistral to 4-bits is for sure losing a lot of the quality compared to MXFP4 post-training. https://twitter.com/i/web/status/1961111279847743636 https://twitter.com/i/web/status/1961111279847743636"
X Link 2025-08-28T16:58Z 53.2K followers, [----] engagements
"Btw I have some anecdotal evidence that disabling thinking for GLM-4.7-Flash improves performance for agentic coding stuff. Haven't evaluated in detail yet (only opencode) as it takes time but would be interest to know if you give it a try and share your observations. To disable thinking with llama.cpp add this to the llama-server command: --chat-template-kwargs ""enable_thinking": false" Here is my config for reference: https://twitter.com/i/web/status/2016903216093417540 https://twitter.com/i/web/status/2016903216093417540"
X Link 2026-01-29T15:56Z 53.2K followers, [----] engagements
"@erusev @antoniostoilkov Still early feedback welcome Check out LlamaBarn at https://github.com/ggml-org/LlamaBarn https://github.com/ggml-org/LlamaBarn"
X Link 2026-01-29T16:31Z 53.3K followers, [----] engagements
"New account for ggml news and notable PRs https://t.co/rw5hHy3PQu https://t.co/rw5hHy3PQu"
X Link 2025-04-23T14:16Z 53.3K followers, 58.1K engagements
"GLM-4.7-Flash-GGUF is now the most downloaded model on @UnslothAI"
X Link 2026-02-10T12:59Z 51.3K followers, 56.7K engagements
"Qwen3-Coder-Next and Minimax-M2.1 are available on HF inference endpoints with the price of $2.5/hr and $5/hr respectively. With the context fitting supported you can now utilize the largest context length possible for a given hardware. No more manual tuning -c option"
X Link 2026-02-09T16:25Z [----] followers, [----] engagements
"Introducing LlamaBarn a tiny macOS menu bar app for running local LLMs Open source built on llama.cpp"
X Link 2026-01-29T16:31Z 53.3K followers, 53K engagements
"@erusev @antoniostoilkov Still early feedback welcome Check out LlamaBarn at https://github.com/ggml-org/LlamaBarn https://github.com/ggml-org/LlamaBarn"
X Link 2026-01-29T16:31Z 53.3K followers, [----] engagements
"Hugging Face Inference Endpoint now supports deploying GLM-4.7-Flash via llama.cpp for as cheap as $0.8/hr Using Q4_K_M and 24k tokens context length - should be enough for most use case"
X Link 2026-01-26T12:26Z [----] followers, [----] engagements
"π¦llama.cpp supports Anthropic's Messages API for a while now with streaming tool calling and reasoning support. Compatible with Claude Code. See more here: https://huggingface.co/blog/ggml-org/anthropic-messages-api-in-llamacpp https://huggingface.co/blog/ggml-org/anthropic-messages-api-in-llamacpp"
X Link 2026-01-19T17:46Z [----] followers, [----] engagements
"Recent contributions by NVIDIA engineers and llama.cpp collaborators resulting in significant performance gains for local AI"
X Link 2026-01-06T06:42Z 53.3K followers, 36K engagements
"Excited about this LFM2.5-Audio-1.5B Real-time text-to-speech and ASR Running locally on a CPU with llama.cpp Interleave speech and text It's super elegant I'm bullish on local audio models https://t.co/Fw8RWAg4bG LFM2.5-Audio-1.5B Real-time text-to-speech and ASR Running locally on a CPU with llama.cpp Interleave speech and text It's super elegant I'm bullish on local audio models https://t.co/Fw8RWAg4bG"
X Link 2026-01-06T16:15Z 53.3K followers, 17.3K engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing