The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.
> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago
Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
Also the proprietary models a year ago were not that good for anything beyond basic tasks.
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)
I did not expect this to be a limiting factor in the mac mini RDMA setup ! -
> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.
Thermal throttling of network cables is a new thing to me…
I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.
70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.
> Heck look at /r/locallama/ There is a reason its entirely Nvidia.
That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:
You have a point that at scale everybody except maybe Google is using Nvidia. But r/locallama is not your evidence of that, unless you apply your priors, filter out all the hardware that don't fit your so called "hypotheticals and 'testing grade'" criteria, and engage in circular logic.
PS: In fact locallamma does not even cover your "real world use". Most mentions of Nvidia are people who have older GPUs eg. 3090s lying around, or are looking at the Chinese VRAM mods to allow them run larger models. Nobody is discussing how to run a cluster of H200s there.
Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.
Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.
Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.
It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.
The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.
I've never heard of this guy before, but I see he's got 5M YouTube subscribers, which I guess is the clout you need to have Apple loan (I assume) you $50K worth of Mac Studios!
I'll be interesting to see how model sizes, capability, and local compute prices evolve.
A bit off topic, but I was in best buy the other day and was shocked to see 65" TVs selling for $300 ... I can remember the first large flat screen TVs (plasma?) selling for 100x that ($30K) when they first came out.
Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models