The “200b Parameter Cruncher Macbook Pro” Exploring the M4 Max LLM Performance

Sean Vosler
3 min readOct 30, 2024

--

Tokens Processed Per Second

Brand new Macbook Pros just dropped by Apple with the claim that it can “easily interact with LLMs which have 200 billion parameters” with the 128GB vram and new M4 Max chips.
What does that mean? With the right configuration, you’d be able to load the most demanding LLMs locally (and mobile) and interact with them in all sorts of interesting ways.

Keep in mind that ‘bigger isn’t always better’ when it comes to the parameters of the models themselves; today’s 8B models can often outperform 100B models from just 6 months ago when it comes to getting the most out of context windows AND medium size models this sort of horsepower is a game changer… basically with a reasonably powerful model (like Minstral 8B) you could likely take full advantage of its 128k token context window — so, you could include a few hundred pages of text IN the prompt, and you could interact with it all in context at a reasonable interaction rate (T/S).

In english — Local LLM’s are going to get a lot more usable, developers who are making bleeding edge software are going to be able to leverage that in very interesting ways!
I’d love to have a local knowledge base management software which used Minstral 8B + Minstral Embed locally to fully embed my book collection, my article collections, all my notes, and allow me to interact with it FAST… it can be done now, but this new #M4 Pro processor is perfect for the task, and you can put it in your backpack. What a time to be alive.

Apple’s new MacBook Pro features the incredibly powerful M4 family of chips

Simply claiming to be able to run 200b LLMs locally “easily” is a huge claim to anyone who has tried on a computer that doesn’t have a beefy GPU.

Some nerdy specs to consider…

Tokens Processed Per Second

Average speed (tokens/s) of generating 1024 tokens by GPUs on LLaMA 3.
- M1 Max: 34.49 tokens/s (8B Q4_K_M model)
- M2 Ultra: 76.28 tokens/s
- M3 Max: 50.74 tokens/s

- M4 Max — Projected (tablecloth math)
-> Text Generation: ~96–100 tokens/s (8B Q4_K_M model)
-> Prompt Processing: ~1,200–1,300 tokens/s

This projection is based on:
- The 1.9x GPU performance improvement over M1 Max
- The increased memory bandwidth (546GB/s)
- The 3x faster Neural Engine
- Historical scaling patterns from M1→M2→M3

How the M4 Max Fits into the GPU landscape:

Keep in mind that we’re comparing a LAPTOP with processors/GPUs that cost more and sometimes many more than the entire system.

M4 Max Projected speed: ~96.41 tokens/s

Would place M4 Max at about:

  • 67% of H100 PCIe performance (144.49 tokens/s)
  • 74% of RTX 4090 performance (127.74 tokens/s)
  • 26% faster than M2 Ultra (76.28 tokens/s)
  • 90% improvement over M1 Max (34.49 tokens/s)

Would represent:
~17% of H100 PCIe performance (7,760 tokens/s)
~19% of RTX 4090 performance (6,898 tokens/s)
~26% improvement over M2 Ultra (1,023 tokens/s)
~262% improvement over M1 Max (355 tokens/s)

Now, the real question is, with Thunderbolt 5 + NVIDIA’s 5090 series coming along, what kind of madness could we see in external GPU’s if software support comes along?

What do you think? At $3199 starting price for the Macbook Pro with M4 max does this sort of LLM performance make a difference in your buying decisions?

--

--

Sean Vosler
Sean Vosler

Written by Sean Vosler

Author 7 Figure Marketing Copy & Affiliate Manager @ Jasper.ai

Responses (9)