Boosting Large Language Model Performance with Speculative Decoding (Guess‑and‑Check)

Name: Your Local LLM Is 3x Slower Than It Should Be
Uploaded: 2026-02-19T11:15:39.304982+00:00
Channel: Alex Ziskind
Description: Summary and key takeaways on Boosting Large Language Model Performance with Speculative Decoding (Guess‑and‑Check), covering Introduction The video

Alex Ziskind

Feb 19, 2026

•

3 min read

YouTube video ID: qmAbco38pXA

Source: YouTube video by Alex Ziskind — Watch original video

PDF

Introduction

The video demonstrates how to dramatically speed up inference of massive LLMs (e.g., Meta Llama 3.1 70‑72B) on a single Mac using speculative decoding – a technique the author renames “guess‑and‑check”. A small “draft” model quickly predicts the next token; the large “target” model then verifies the prediction. If the guess is correct, the target model skips generation, effectively doubling throughput.

How Speculative Decoding Works

Draft model: a lightweight model (1‑7 B parameters) runs fast and proposes the next token.
Target model: the heavyweight model (14‑72 B) checks the draft’s token. If accepted, the token is emitted without full computation.
Compatibility: Draft and target must share the same tokenizer/vocabulary (e.g., all Quen 2.5 variants or Llama 3.1 family).
Visualization: LM Studio shows “draft tokens accepted” and can color‑code correct guesses.

Tooling

LM Studio: UI to enable speculative decoding, select draft models, and view stats (tokens / sec, accepted draft tokens).
Llama CPP & VLM: CLI‑based runtimes that also support the technique.
Draftbench (GitHub): Open‑source benchmark that sweeps combinations of target‑draft pairs, measures speed‑up, and reports the optimal pairing. It automates the otherwise tedious manual testing.

Choosing the Right Draft Model

The author ran exhaustive tests on an M3 Ultra Mac Studio and an M1 Max MacBook Pro. Key findings: - 72 B target (Q8 quant): baseline ≈ 8.7 tps. Best speed with a 1.5 B draft → 27.6 tps (≈ 216 % boost). 0.5 B draft also strong (25.2 tps). 7 B draft gave 26.2 tps – good but not optimal. - 14 B target (FP16): baseline ≈ 22 tps. With a 1.5 B draft → 72 tps (≈ 216 % boost). Quantized versions (Q8, Q4KM, Q4) also improve, but FP16 + draft yields the highest quality. - 7 B target: modest gains; FP16 version benefits most, while heavily quantized drafts (Q2K, Q3KM) degrade quality. - 32 B target: similar pattern – any compatible draft improves throughput; the sweet spot remains around 1‑1.5 B drafts.

Quantization Impact

Higher‑precision (FP16, Q8) models retain quality but run slower.
Lower‑precision (Q4, Q4KM, Q40) run faster but may lose some answer fidelity.
Speculative decoding lets you keep a high‑quality target (FP16/Q8) while regaining speed via a tiny draft.

Practical Workflow

Select target model (size & quantization) based on hardware memory.
Pick a draft model that shares the tokenizer (usually same family, smaller size).
Enable speculative decoding in LM Studio or pass --speculative flag in Llama CPP.
Run Draftbench to benchmark all draft‑target combos you care about.
Deploy the best pair for daily inference; monitor accepted‑draft ratio to ensure quality.

Additional Resources Mentioned

boot.dev: an RPG‑style platform for learning back‑end development (Python, Go, JavaScript) with AI‑assisted hints. Free lesson browsing; paid membership unlocks full features.
GitHub – Draftbench: repository containing the benchmarking script and result visualizations.

Results Summary (Heat‑Map Insight)

Green cells = significant speed‑up (often > 150 %).
Red cells = slowdown (e.g., overly large drafts or overly quantized targets).
The most consistent winners: 1.5 B and 0.5 B drafts for 14‑72 B targets.

Conclusion

Speculative decoding (guess‑and‑check) transforms otherwise unusable large‑model inference into a practical, high‑throughput solution on consumer‑grade hardware. By pairing a tiny, fast draft model with a high‑quality target model and using tools like LM Studio or Draftbench, you can achieve 2‑3× speed‑ups without sacrificing answer quality.

Speculative decoding lets you keep the accuracy of massive LLMs while gaining 2‑3× faster generation by intelligently pairing them with tiny draft models—a game‑changer for running large models on a single workstation.

Frequently Asked Questions

Who is Alex Ziskind on YouTube?

Alex Ziskind is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

How Speculative Decoding Works

- **Draft model**: a lightweight model (1‑7 B parameters) runs fast and proposes the next token. - **Target model**: the heavyweight model (14‑72 B) checks the draft’s token. If accepted, the token is emitted without full computation. - **Compatibility**: Draft and target must share the same tokenizer/vocabulary (e.g., all Quen 2.5 variants or Llama 3.1 family). - **Visualization**: LM Studio shows “draft tokens accepted” and can color‑code correct guesses.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Deep Learning With Python Book Recommended

Provides foundational knowledge for building and fine‑tuning neural networks, helping readers understand the concepts behind speculative decoding and model quantization

Amazon →

Apple Macbook Pro M1 Max 64gb

Powerful enough to run large language models locally; the extra memory and GPU cores enable the high‑throughput inference demonstrated in the article

Amazon →

Raspberry Pi 4 Model B 8gb Kit

Affordable hardware for experimenting with small draft models and testing speculative decoding on edge devices

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

All right, you're going to like this.
Watch this. Metal llama 3.170b
instruct 8bit quant. I'm going to paste
in this prompt. Write a Python class for
a binary search tree with insert,
delete, search, and in order traversal
methods. Boom. This is running on an M4
Max MacBook Pro, by the way. Pretty fast
machine, but this is a very dense model
and it's also 8bit, so it's not going
super fast. Let's fast forward after
this whole thing finishes. I got 6.31
tokens per second. I would say that's
pretty close to unusable range. But here
I am running the exact same model on the
exact same hardware with the exact same
prompt, but it's going two times faster.
Here it is. It's finished. And I got
12.26 tokens per second. How did I do
this? It's called
>> Whoa, whoa, whoa. You're not going to
call it that, are you? People are going
to click off the video if you call it
that.
>> Why? Most of the people watching already
know what that is. Sounds like some kind
of firmware update.
>> Yeah, it's pretty terrible.
>> Asotic complexity is also a term, but
doesn't mean you'll lead with that.
>> What do you suggest we call it?
>> Call it what it actually is. Guess and
check. A small model drafts ahead and
the big model checks it.
>> Or how about draft and verify?
>> No, no, no. That sounds like a tax form.
Fine, guess.
>> Yeah, much better. So, what we need is
to find the perfect guess and check
models that work together well. And a
lot of the tools that you already use
for generations support this. LM Studio,
Llama CPP, VLM. Here's just a taste in
LM Studio. But we'll come back to Llama
CPP in a moment. I can do that just by
opening up chat, selecting a model, and
adding a draft model. But wait a minute,
what kind of draft model can I add? Cuz
there are a lot of different models out
there. Which ones are going to be more
suited to speed things up for you?
Because some will not speed things up
for you, as I'll show you in a bit, will
actually degrade your performance. Like
this one right here, the speed up is
negative 38%. Yeah, you don't want to do
that. So, of course, I had to go write a
program to help you find the sweet spot.
AI helps me move fast, but it can't
replace the fundamentals, so I keep
investing in them. Meet boot.dev. Hands
down the most engaging way I found to
learn back-end web development. It's
basically an RPG for coding. Quests, XP,
leaderboards, and built-in AI guidance
so you can keep moving. Projects and
courses in Python, Go, and JavaScript
align to real back-end tasks. APIs, O
databases, testing, caching, everything
that matters for building real services.
There's a helpful Discord, clear answers
when you're stuck, and a curriculum
built for real jobs. The best part, you
can browse lessons for free. Yeah, free.
A membership unlocks interactive coding,
AI hints, progress tracking, and a game
layer. I used to binge tutorials like a
Netflix show. Great entertainment, zero
muscle. Boot.dev turns that time into
practice. that ships real services. Hit
the link below and use my code for 25%
off your first payment. I'll see you in
the quests. I decided to start out with
a model that's not the newest model, but
Quen 2.5 has probably the most different
variants out there. Look at this. I'm on
Quen's hugging face, and we've got four
pages of Quen 2.5 models. There's coder
varieties. There's 1 million context
window varieties. I just wanted to go
simple. We've got quen 2.5
billion. We've got 7 billion. There's
also 3 billion. There's 14 billion, 32
billion parameters, and 72 billion
parameters. There's a lot to pick from.
So I thought this would be a good model
to get a lot of different combinations
and see how they interact with each
other. So for example, if you use uh the
target model to be 72 billion parameter
model, that's if you want to run that 72
billion parameter model. How do you make
it a little bit faster? Well, if your
target model is 72 billion parameters,
then your draft model will be smaller.
It'll be like a 7 billion or a 3 billion
or a 1.5 billion or a.5 billion. But how
do you know which one to use? That's
what we're trying to answer because it's
not something that's documented
anywhere. There's also different
quantizations of each one. So 72 billion
parameters comes in FP16, Q40, Q4 km,
Q8, Q6. So, I got a couple of those. I
got the 8-bit version and I got the
4-bit version of the 72 billion
parameter model. And then, for example,
a 7 billion parameter model. You'd
think, "Oh, that might be a good one as
a draft model." By the way, I'll explain
all this in a moment. Just hang on with
me for a second. You'd think, "Oh, 7
billion is uh pretty good, right?" But
then you go in here and you say, "Which
7 billion should I run? There's so many.
Here's one by Bartowski." But it's not
just one. There's two bit, there's three
bit, there's four bit, there's five,
six, eight, even 16. Well, as it turns
out, the 7 billion parameter model is
not a good draft model for the 72
billion. 72 billion Q80 quant gives us
8.7 tokens per second on this M3 Ultra
Max Studio that I got over here. 8.7
hardly usable, but look, we can get up
to 27.6 6 tokens per second from the
same model by using the best 1.5 billion
draft model and.5 billion draft model is
pretty good too. 25.2 tokens per second.
Then the best 3 billion draft model
gives us 27. And finally there's the 7
billion draft model 26.2. So 7 billion
not the best but still not bad. All
these smaller models running as draft
models using speculative decoding.
There, I said it. Okay. Make this model
actually usable. Put this model in the
usable range with the tokens per second.
Give me a moment and I'll go into these
details as well. I know that a lot of
you already know this, but some of you
might not. What is this dirty word,
speculative decoding, or as what we're
going to call it now, guessing check?
Well, if we go back to LM Studio, it has
a really nice way of visualizing this.
Let's start from scratch. I'm going to
create a new chat. I already have the
model loaded. This is 70 billion Meta
Lama 3.1 8bit. If I open up this little
sidebar here and go to tweak some things
up here and then go down to the bottom
where it says speculative decoding, it's
either on or off. It's most likely going
to be off to begin with, but you can
select a draft model. Smaller models run
much faster no matter what hardware
you're using. And because it runs
faster, it's going to be able to guess
at the next token that you need much
faster than the big model. So, it's
going to take a guess. And look, all
these small models, these draft models
are going to be compatible with the main
target model. LM Studio knows that. The
reason they're compatible is because
they share the same tokenization
language. You see, along with all these
files that you usually see when you
download a model, let's go to Quen for
example. Quen 2.514 billion instruct
files and versions. Usually when you
download these you not only get the
models the weights which are the giant
files but you also get these tokenizer
JSON vocab JSON all these extra config
JSON all these extra files that describe
the model and describe the architecture
of the model as well as the tokenization
language and the vocab and if the vocab
is the same between models for example
models with the same architecture like
quen 2.5 whether it's 14 billion or 72
billion or 32 billion they're all going
to share the same vocabulary So, you're
going to be able to use them as draft
models for the big models. In the case
of Llama 3.1, these are the ones that
you can use as draft models.
It's showing me that I can use the 70
billion parameter instruct model as a
draft model for the 70 billion parameter
model target model that I want to run.
But I can guarantee you that's not going
to be fast at all. If you select
something really small like this tiny
tiny model, the Llama 3.2, 2, which is
also compatible by the way, 1 billion
parameter model. It's only 712 megabytes
in size. And you run this together, then
you're going to get output that's much
faster. Most of the time, not always
though. And that's why I wrote this
tool. You see these statistics down
here? These you usually get the tokens
per second, the number of tokens, how
long it takes, stop reason. But this
one, 64.7 draft tokens accepted, is
something you only see when speculative
decoding is turned on. You can also
click this visualize accepted draft
tokens button and it'll make every token
that the smaller model guessed and the
larger model accepted. It'll make it
green. So you can see that there's quite
a lot of these that were guessed
correctly by this tiny llama 3.21
billion model and then it guessed it and
the bigger model says okay that's good.
We'll go with that. So instead of the
big model having to generate those
tokens from scratch, all it did was just
accept the smaller model's tokens,
making the whole thing that much faster,
two times faster. All right, so now that
I explained how the draft model/target
model system works, and by the way, this
works in LM Studio, this works in other
tools like VLM and Llama CPP. The tool I
created will support multiple, but right
now it does only Llama CPP. However,
it's open source and you can check it
out and you can commit pull requests if
you want to add something to it. It's
called Draftbench. You can find it on my
GitHub. And the problem that it solves
is it finds which draft model works best
with the right target model. If you
select too small a model or a model
that's too highly quantized, it might uh
just degrade performance because it's
going to give you poor results and the
acceptance rate will be low. If you
select a model that's too big, it's
going to slow things down. So you want
to have that just right and this
solution helps you find that sweet spot.
So you provide a list of target models
and a list of draft models and this
benchmark will do a sweep across all
those and it'll do the combinations that
you want to find the best solution. So
you don't have to do it manually. This
takes you through the steps. You build
llama CPP. I've shown how to do this
before especially for members of the
channel. Thank you members. Members of
the channel get extra videos where I go
into a little bit more detail about
certain things. So, you can follow these
instructions. By the way, if you are a
member and you want to see a detailed
rundown of how to run this thing, let me
know in the comments down below and also
include your special emojis that members
can use. But here is the code. This is
the benchmark code. It does a couple of
prompts that are optimized to show you
the best results for speculative
decoding. So, it does three prompts for
each one and does an average at the end.
Overall, this takes many, many hours to
run. You probably want to run it
overnight, but it also depends on how
many combinations you want to run. So,
here's an example of a config you can
set up where you have your target
models. In my case, I have a separate
file for the 72 billion parameter
models. And I have the 72 billion Q8
here and the Q4KM that I tested. And
these are all the draft models that I
tested with that one. Anything that's
basically smaller up to 7 billion. I
didn't do 14 or 32. That was kind of
pointless. So 14 to 32 are good
candidates for target models, I thought.
Because what if you're not running it on
a MacBook Pro with 128 gigs of memory or
a Mac Studio with 512 gigs of memory?
What if you're running it on a MacBook
Air? Then you can probably get away with
running a 14 billion parameter model.
That's a Q4KM or a Q40 and then a
smaller draft model. But again, which
one? So let's see some of the results
here. Here are the results for the 14
billion parameter models and and I did
four targets. FP16, Q80, Q4KM, and Q40.
And as you can see, speculative decoding
using the draft model helps with all of
these across the board, but some of them
it helps more than others. So, some of
you might think, oh, why would I run the
FP16 version of the 14 billion parameter
model when I can run a quantized version
like the Q80 or the Q4KM? Well, because
when you quantize a model, it usually
degrades some of the quality of the
result that you get. So, the less
quantized the model technically is
better. Not always, but most of the time
it is. So, the higher quality will be
like FP16 quant or the Q8 quant and then
Q4s will follow. And if you take a look
at the baseline with no draft model,
this is just how it runs by itself. 22
tokens per second for this 14 billion
parameter model and FP16 is going to be
pretty slow. Then if you go to Q8, it's
going to be a little faster, 38 tokens
per second. And if you go to Q4KM, 55
tokens per second. And Q4 will give you
the fastest time, 58 tokens per second.
So you think, oh, I'm always going to
run the Q4 because that's the fastest.
Not so fast. I mean, you know what I
mean. Look what happens when you add a
draft model. Suddenly running the FB16
version of the 14 billion parameter
model is not that much slower than the
Q4 version. FB16 with the best 1.5
billion draft. So using a draft model
that's 1.5 billion, but there's a lot of
different quantizations of that. The
best quantization is what this chart
shows is bringing you up to almost 72
tokens per second from 22. That's crazy.
That's a 216% jump. And if you compare
that with 14 billion Q4 and you use the
best draft model for that, that's only
79. So we're not that far off. And we
can use the FP16 version of the 14
billion parameter model and get higher
quality results. So that's what this
chart shows. This next chart shows the
speed up versus baseline percentage. And
you can see how much more benefit the
draft models will be for an FB16 quant
of that 14 billion parameter model. 216%
213 181%. And then for the Q8 we're
getting 116%. We're still doing really
good for the Q8. Q4. We're still getting
a benefit in the Q4 even, but not as
much of a benefit as the Q8 or the FP16
quants. Finally, down here, you'll see a
heat map. This heat map will show you
the target model at the top. Obviously,
you want to go green. You don't want to
go red. Right down here in the very
bottom right corner is the speed up of
only 2.7%. It's still a speed up. Ah,
here's a slowdown. This one is no good
at all. 17.4% 4% if you're using Q4 and
a draft model of 3 billion FP16. No
good. So obviously you want to be
somewhere up here. Let's take a look at
some other ones. If you're running a 7
billion parameter model as your target
model, you still get benefits. Those
people that are going to be running this
are going to be running it on a really
small machines like I don't know a
machine that has maybe 8 gigs of VRAM or
16 gigs of unifi memory on Apple silicon
for example. So, I ran Q2K and Q3KM, but
I don't recommend those models at all.
Like, you're going to be losing way too
much quality by using target models that
are quantized so highly. So, don't do
that. But here, 7 billion FP16, pretty
good little bump you get there. Speed up
versus baseline. Obviously, the FP16
gets the most improvement there. Then
the Q8, Q6, not so much. So, I'd skip
that one. Q5 also not so much. Q50. Not
bad. Using this uh half a billion Q8 as
the draft model. Here's the whole
lineup. I'll probably make these charts
available as well as part of the
repository in the results folder. Here's
a 32 billion target model. A pretty
decent model. Very capable. But look,
you can use this tool on any model
really, any combination as long as they
share the same vocap. Take a look at
that. And finally, 72 billion. We
already went over that, but we didn't go
over Q8 versus Q4KM. And there's a 72
billion parameter full results heat map,
which gets benefits from a draft model
pretty much all across the board. Even
the 7 billion parameter models help it
out quite a bit. I was surprised that
the best improvement came not from a 7
billion parameter model, but from a 1.5
billion parameter draft. Now, besides
running it on the M3 Ultra, I also ran
this on the MacBook Pro. And here is
just one example of the results. And
it's pretty consistent. 72 billion
parameters. We're getting about 200%
improvement for the Q8 model with the
1.5 billion parameter draft. Yeah, it's
about the same uh kind of heat map as
well. So anyway, go check this out, play
around with it, use it on your own
models, and let me know how it goes.
Also, you might really enjoy this video
next. Speculative decoding son of a
Yeah.

PDF

Introduction

How Speculative Decoding Works

Tooling

Choosing the Right Draft Model

Quantization Impact

Practical Workflow

Additional Resources Mentioned

Results Summary (Heat‑Map Insight)

Conclusion

Frequently Asked Questions

Who is Alex Ziskind on YouTube?

Does this page include the full transcript of the video?

How Speculative Decoding Works

Helpful resources related to this video

Share This Summary

Embed This Summary