We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs
Published on 2026.01.13 by DeepInfra
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs

Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a number, then show how different precision modes (bf16, fp8, int8, int4) affect weights, activations, and the KV cache in real LLM systems and what it means for the model quality. 

1) What does precision mean?

Computers don’t write numbers the way we do. They encode them in bits (zeros and ones), and for most real-world decimals, they rely on a standard called floating-point. If you’ve ever written a number in scientific notation—like 3,200 = 3.2 \times 10^{3}—you already know the idea. Floating-point just does the same trick in base 2 instead of base 10.

A floating-point number is split into three parts:

  1. Sign bit – one bit that says whether the number is positive or negative.
  2. Exponent – a few bits that set the power of two (the “2^something” part).
  3. Mantissa (also called the significand or fraction) – the bits that capture the digits of the number itself (the part in front of the exponent).

Put together, a float behaves like:

The bias is just a fixed offset so that we can represent both very small and very large exponents with unsigned bits.

Take the decimal 6.5. In base 2, that’s 110.1. Scientific-notation style in base 2, it becomes:

  • Sign = 0 (it’s positive).
  • Exponent = 2 (plus the bias, depending on the format).
  • Mantissa = the digits after the binary point: 101…

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4783fd02-a138-40c7-82c7-79dd05a179e4_1472x772.png 

Different floating-point formats allocate different numbers of bits to these three fields. fp32 uses 1 bit for sign, 8 for exponent, and 23 for mantissa; fp16/bf16 shrinks those fields; fp8 shrinks them even more. Fewer mantissa bits mean coarser “ruler marks,” and fewer exponent bits mean a narrower dynamic range. That’s why lower-bit formats are smaller and faster—but also why they can lose detail if you push them too far.

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eafac2a-d027-4d66-95de-7030e0392b39_1796x940.png 

2) Why is the precision crucial for LLMs, and what is the problem?

Large Language Models (LLMs) such as GPT, Claude, or LLaMA consist of billions of parameters. Each parameter is a small number that contributes to how the model predicts the next token. When stored as 32-bit floats, these parameters consume enormous amounts of memory—hundreds of gigabytes for state-of-the-art models. As a result, precision becomes not just a theoretical detail but a core design concern that affects every aspect of model performance, scalability, and cost.

The problem is that there is an inherent trade-off. Reducing the bit-width means fewer bytes must be moved and processed. This improves throughput and lowers cost per token—critical metrics for inference systems serving millions of users. However, if we push precision too low, we risk numerical instability and degradation in accuracy. Subtle rounding errors or a limited dynamic range can distort the learned representations, resulting in degraded text quality, logical errors, or even complete model collapse.

In essence, precision is the balancing act between performance and reliability. It is one of the fundamental levers for optimizing LLM deployment.

3) What are the most common precision modes?

Before we can dive deeper into what parts of the model are relying on the precision, we need to understand what different precision options there arae. When we talk about “precision” for LLMs, we’re really choosing a numeric language for the model’s tensors. Different formats trade accuracy for speed and memory in different ways. 

  • fp32 (single precision): The traditional 32-bit float is numerically stable and forgiving. It’s also heavy: twice the memory of 16-bit formats and rarely economical for large-scale inference. You’ll still see it in training or for sensitive accumulation paths, but not as your serving default.
  • fp16 (half precision): fp16 cuts memory in half versus fp32 and is widely optimized in today’s kernels. Its exponent is narrower than bf16, which means less dynamic range; that can bite on “spiky” activations unless you mix in higher-precision accumulations. In practice, fp16 is a strong baseline for quality-first inference when your stack is tuned for it. 
  • bf16 (brain floating point): bf16 preserves fp32’s wider exponent field but trims the mantissa, so you keep dynamic range while saving memory. That’s why many teams find bf16 a bit more stable than fp16 on volatile activations, especially in training or mixed-precision inference. 
  • int8 (8-bit integer quantization): The workhorse for production inference: mature toolchains, predictable memory wins (≈½ of fp16), and—when you calibrate well—often negligible quality loss. Seminal results like LLM.int8() (https://arxiv.org/pdf/2208.07339) also explain why it works: most channels quantize cleanly, while a small set of “outlier” features should be handled in higher precision (a mixed path).

https://arxiv.org/pdf/2208.07339

  • fp8 (E4M3, E5M2): An emerging family of 8-bit floating formats standardized across vendors. In practice, fp8 is used in mixed precision—fp8 inputs with higher-precision accumulators (fp16/fp32)—to harvest big bandwidth and memory wins without destabilizing the math. 
  • int4 / 3-bit (ultra-low bit): Great for aggressive compression (≈¼ the memory of fp16), but quality becomes task- and layer-sensitive. You’ll want per-channel or per-group scaling and to “protect” fragile layers (embeddings, final projections) at higher precision. 

4) Which parts of the LLM can be adjusted using different precision?

In the first place, the precision has a large impact on the models concerning the weights (the model’s parameters). Weights are the largest, always-present chunk of memory. Moving from a 16-bit floating format (fp16/bf16) to int8 immediately halves the weight memory; moving again to int4 halves it once more. Because weight values are relatively stable and well-behaved statistically, weight-only int8 or int4 is often the safest way to unlock big cost savings with little or no visible quality loss—especially if you keep a few “sensitive” layers (like embeddings and the final projection) in higher precision.

Next are the activations, the intermediate results produced as the model processes your tokens. Activations drive bandwidth during matrix multiplications, so in theory, lowering their precision can deliver large speedups. In practice, many production stacks keep activations at bf16/fp16 even when weights are in int8 or int4. That’s because activation ranges can be spiky and input-dependent; pushing them to very low bit-widths without careful calibration or quantization-aware training tends to cause accuracy regressions, especially on math, coding, and long-form reasoning. The result is a common compromise: low-bit weights for memory, higher-precision activations for stability.

Finally, there’s accumulation precision—the precision used inside the math itself when partial sums are built up. Even when inputs are low-bit, accumulations are kept higher to avoid catastrophic numerical error (for example, int8 × int8 → int32 accumulators, or fp8 inputs with fp16/fp32 accumulation). This mixed-precision pathway is a quiet workhorse of stable low-bit inference: it lets you reap most of the memory and bandwidth savings from compact inputs while preserving the numerical fidelity where it matters most.

So what’s the problem? Lowering precision is not free. Push it too far or in the wrong place and you’ll see quality drift (shorter or less reliable reasoning, code failures, rare-token brittleness), latency surprises (memory savings without speedups if your kernels aren’t optimized), and operational fragility. The art of production LLMs is choosing where to spend your bits. This optimization is performed by our developers at Deepinfra for each model to ensure the best quality models with good performance. 

5) Quality vs speed: why quantization is needed in LLM training

Quantization exists because large language models are bottlenecked by memory bandwidth at inference time. Every generated token requires moving huge tensors—weights, activations, and the attention KV cache—between GPU memory and compute units. 

That raises an obvious question: if low precision is so good, why not train in very low precision too? The answer is numerical stability. Training is far less forgiving than inference: gradients and optimizer states have a wide dynamic range and can be extremely noisy. Push precision too low—say, int8 or int4—and learning typically fails to converge or converges to a worse solution. This is why modern training uses mixed precision rather than uniformly low precision: forward/backward passes in bf16 or fp16 for efficiency, with crucial pieces (e.g., some accumulators, master weights, or optimizer states) kept in fp32 to preserve signal. Ultra-low-bit training remains an active research area; ultra-low-bit inference is the practical, production-ready win today.

Because quantization changes the numbers the model sees and produces, you have to measure its impact on quality—not just speed. The standard top-line metric is perplexity, which captures how well the model predicts text overall. But perplexity alone can be misleading, so pair it with task-level evaluations: QA and reasoning accuracy, code pass@k (does any of k samples solve the problem), and small, curated factuality/hallucination probes. Track latency metrics alongside these—time-to-first-token and steady-state tokens/sec—so you can articulate the true quality-vs-speed trade.

Not all workloads are equally tolerant of low bits. The pain points tend to cluster around code and math, where tiny rounding errors can cascade; very long chains of thought, where errors accumulate across many steps; rare tokens and niche domains, which rely on outlier-heavy channels; and safety/refusal behavior, which can drift subtly under aggressive quantization. Many teams mitigate this by “protecting the edges”: keep embeddings, the first/last transformer blocks, and the final projection at higher precision while quantizing the middle layers.

6) PTQ vs QAT

Quantization is how we turn a float-heavy LLM into something cheaper and faster to run. There are two broad ways to get there. With post-training quantization (PTQ) you take a trained model and compress it without retraining; with quantization-aware training (QAT) you fine-tune the model while simulating low-bit noise so it learns to behave well at those bit-widths. The right choice depends on your time budget, your quality bar, and how low you want to push the bits.

PTQ is the fastest on-ramp. You stream a small calibration set—typically a few thousand sequences that look like your real traffic—through the model to estimate scales and (if needed) absorb outliers, then quantize the weights and, optionally, the activations and KV cache. In practice, most teams begin with weight-only int8 because it yields immediate memory savings and usually preserves quality. Methods in the AWQ/GPTQ/SpQR family protect salient or outlier-heavy channels, and “smooth/absorb” style approaches stabilize activation ranges so weight and activation quantization behave better. A crucial detail is granularity: using per-channel or per-group scaling (e.g., groups of 32/64/128) keeps one volatile channel from ruining the approximation for the rest. PTQ’s appeal is that it’s quick, reversible, and infrastructure-light—you can A/B several schemes and ship the best one without spinning up a training pipeline.

PTQ isn’t a silver bullet. It can stumble on code and math, on very long chains of thought, and on rare-token or niche domains, especially if you try to quantize activations as well as weights or push down to int4. Layer sensitivity matters too: embeddings, the first and last transformer blocks, and the final projection are often more fragile than the middle of the network. A common production pattern is to quantize the middle and keep those edge layers at higher precision, combine int8 weights with bf16/fp16 activations for stability, and compress the KV cache to int8 to unlock longer contexts without out-of-memory surprises.

QAT takes the slower, sturdier route. During fine-tuning, you insert “fake quant” operations so forward and backward passes simulate low-bit arithmetic while the optimizer still runs in a safe precision. The model adapts to quantization noise, which is why int4 targets—and weight+activation quantization—often require QAT to hit acceptable quality. Accumulations stay higher precision (e.g., int8×int8→int32, or fp8 accumulating into fp16/fp32), but the rest of the path experiences the same clipping, rounding, and scaling it will see at inference time. The trade-off is that you need data, compute, and a training loop; you’ll also want the same per-channel/per-group scaling tricks and, where necessary, to keep a handful of sensitive layers unquantized.

In practice, many providers use a hybrid strategy. They ship PTQ weight-only int8 quickly to cover most models and workloads, then apply targeted QAT to the models or layers where PTQ regresses—typically the ones serving coding, math, or long-reasoning use cases, or the ones where activation quantization would yield a big throughput win. This keeps time-to-market low while reserving training effort for the places it matters most.

7) KV-cache quantization & long context

Every autoregressive step reuses past attention Keys and Values instead of recomputing them—that’s the KV cache. At short prompts, its footprint is modest, but as sequence length grows, it can dwarf everything else. A simple back-of-the-envelope explains why: KV memory scales with batch × sequence length × layers × (2 × hidden size) times the bytes per element. That linear dependence on sequence length means long-context serving becomes memory-bound unless you compress the cache. For a friendly walkthrough of what KV caching is and why it speeds up decoding, see Hugging Face’s explainer and their hands-on “KV cache from scratch” post: https://huggingface.co/blog/not-lain/kv-caching 

Quantizing the KV cache is the most direct way to unlock longer contexts and larger batches on the same GPU. In practice, INT8 KV is a solid default: it halves KV bytes versus fp16/bf16 with little task-level impact for many workloads, and it’s supported in mainstream stacks like vLLM and TensorRT-LLM. If you need to push further, INT4 KV can work—but it’s task-sensitive: reasoning and code are more likely to show quality drift, so you should validate carefully on your own prompts before rolling it out broadly. 

Compression alone isn’t enough in very long contexts; you also need better memory management. Paged Attention—the technique behind vLLM—stores KV in fixed-size pages and uses indirection tables to avoid fragmentation and to share/reuse cache efficiently across requests. The result is near-zero KV waste and higher throughput at the same latency, especially as contexts grow. Pairing KV quantization with paged attention (plus sensible eviction policies) is the current best practice for stable, low-latency long-context serving. 

At long sequence lengths, KV dominates your memory budget. Start with INT8 KV to expand effective context and batch within the same HBM; if you explore INT4 KV, do so behind targeted evals (reasoning/coding first). Combine KV quant with Paged Attention and smart eviction to keep latency flat as contexts grow

8) What are the effects of the precision on the model quality?

Changing precision changes the numbers your model sees and produces, so it inevitably changes behavior. Most of the time, the impact is small and positive from a systems point of view (lower latency, longer context) and negligible from a user point of view. The exceptions—and they matter—depend on where you lower precision and how far you push it. However, reducing the precision does not have these large quality impacts that a lot of people are expecting. Even though there are impacts on quality, they are rather small and most of the time do not have a significant influence on the application at hand. 

Precision isn’t a niche implementation detail—it’s the dial that lets you trade memory and bandwidth for speed and cost without (usually) sacrificing what users care about. You’ve seen how the bits work (sign/exponent/mantissa), where they matter most in a transformer (weights, activations, KV cache), and why mixed-precision math keeps things stable. In practice, a safe, high-impact baseline is: int8 weights, bf16/fp16 activations, int8 KV, and higher-precision accumulation—then push to fp8 or int4 only where your evals say it’s worth it. For methodology, start with PTQ for quick wins and coverage; reach for QAT when you want int4 or activation quantization without quality loss. 

Related articles
Langchain improvements: async and streamingLangchain improvements: async and streamingStarting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra. Async generation The deepinfra wrapper now supports native async calls, so you can expect more performance (no more t...
Pricing 101: Token Math & Cost-Per-Completion ExplainedPricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is [&hellip;]</p>
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep InfraLlama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, [&hellip;]</p>