Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run a...

0xbadcafebee · 2026-02-05T00:46:00 1770252360

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

MuffinFlavored · 2026-02-05T01:23:17 1770254597

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

bigyabai · 2026-02-05T01:33:54 1770255234

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

MuffinFlavored · 2026-02-05T02:01:59 1770256919

Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

omneity · 2026-02-05T05:07:23 1770268043

It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

polynomial · 2026-02-05T06:23:52 1770272632

Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)

https://buildai.substack.com/i/181542049/the-mac-mini-moment

danw1979 · 2026-02-05T12:26:54 1770294414

I did not expect this to be a limiting factor in the mac mini RDMA setup ! -

> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.

Thermal throttling of network cables is a new thing to me…

cat_plus_plus · 2026-02-05T15:48:36 1770306516

I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.

polynomial · 2026-02-05T17:26:43 1770312403

Luckily we're having a record cold winter and your setup can double as a personal space heater.

deaux · 2026-02-05T02:59:05 1770260345

And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.

Now as the other replies say, you should very likely run a quantized version anyway.

bigyabai · 2026-02-05T00:23:00 1770250980

"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

PlatoIsADisease · 2026-02-04T23:46:11 1770248771

[flagged]

zozbot234 · 2026-02-04T23:53:08 1770249188

70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.

PlatoIsADisease · 2026-02-05T00:00:09 1770249609

>may

I'm completely over these hypotheticals and 'testing grade'.

I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.

hnfong · 2026-02-05T02:09:16 1770257356

> Heck look at /r/locallama/ There is a reason its entirely Nvidia.

That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:

- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...

- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...

PlatoIsADisease · 2026-02-05T12:00:43 1770292843

I specifically mentioned "hypotheticals and 'testing grade'."

Then you sent over links describing such.

In real world use, Nvidia is probably over 90%.

hnfong · 2026-02-05T16:40:39 1770309639

r/locallamma/ is not entirely Nvidia.

You have a point that at scale everybody except maybe Google is using Nvidia. But r/locallama is not your evidence of that, unless you apply your priors, filter out all the hardware that don't fit your so called "hypotheticals and 'testing grade'" criteria, and engage in circular logic.

PS: In fact locallamma does not even cover your "real world use". Most mentions of Nvidia are people who have older GPUs eg. 3090s lying around, or are looking at the Chinese VRAM mods to allow them run larger models. Nobody is discussing how to run a cluster of H200s there.

K0balt · 2026-02-05T01:44:32 1770255872

Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.

PlatoIsADisease · 2026-02-05T11:58:09 1770292689

>quants

>moderate context windows

Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.

K0balt · 2026-02-06T19:08:19 1770404899

Equal to the 2x3090? Yeah it’s about equal in every way, context windows included.

As for useful at that scale?

I use mine for coding a fair bit, and I don’t find it a detractor overall. It enforces proper API discipline, modularity, and hierarchal abstraction. Perhaps the field of application makes that more important though. (Writing firmware and hardware drivers).

It also brings the advantage of focusing exclusively on the problems that are presented in the limited context, and not wandering off on side quests that it makes up.

I find it works well up to about 1KLOC at a time.

I wouldn’t imply they were equal to commercial models, but I would definitely say that local models are very useful tools.

They are also stable, which is not something I can say for SOTA models. You cal learn how to get the best results from a model and the ground doesn’t move underneath you just when you’re on a roll.

sealeck · 2026-02-05T00:03:18 1770249798

Are you an NVIDIA fanboy?

This is a _remarkably_ aggressive comment!

PlatoIsADisease · 2026-02-05T01:11:06 1770253866

Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.

teaearlgraycold · 2026-02-04T22:13:37 1770243217

Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

SchemaLoad · 2026-02-04T22:39:29 1770244769

It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

whatsupdog · 2026-02-05T03:23:38 1770261818

I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

cactusplant7374 · 2026-02-04T23:01:22 1770246082

Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

paxys · 2026-02-05T00:18:23 1770250703

Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

cactusplant7374 · 2026-02-05T22:47:31 1770331651

I wouldn’t be so sure. Most users aren’t going to use up their quota every week.

teaearlgraycold · 2026-02-05T00:08:37 1770250117

For sure Claude Code isn’t profitable

bdangubic · 2026-02-05T00:50:04 1770252604

Neither was Uber and … and …

plagiarist · 2026-02-05T01:50:19 1770256219

Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

bdangubic · 2026-02-05T13:32:52 1770298372

that is coming for sure to replace the "500" errors

blharr · 2026-02-04T22:35:19 1770244519

What speed are you getting at that level of hardware though?