LLMs & Their Size In VRAM Explained – Quantizations, Context, KV-Cache

Choosing an LLM For Your GPU VRAM - A Quick Reference Guide

To run an open-source LLM locally on your GPU efficiently, you’ll need to fit all of the data it needs to work on during inference in your graphics card’s video memory (VRAM). The final size of the model in VRAM will mainly depend on three things: the model’s size in parameters (8B, 12B, 30B), its context window data, and runtime KV cache values. The rule of thumb is: take the model size, and add about 20% overhead. That however, is just half the story.

Contents hide

1. Quick Glossary – Key Terms Explained

2. Why VRAM Matters for Local LLMs

3. Model Size vs. Memory Requirement

4. Quantization: Fitting Models into Less VRAM

5. Matching Model Size to Your Needs

6. The Impact of Context Length on VRAM

7. RAM Offloading: When VRAM Is Limited

8. VRAM Calculators For Local LLMs – A Short List

Quick Glossary – Key Terms Explained

Term	Meaning
Weights / Parameters	Fixed learned values (numbers) defining the model’s behavior and knowledge. Loaded into VRAM (or RAM) during inference.
Quantization	Compressing the 32/16-bit weights to 8/6/5/4 bits; smaller file & VRAM footprint at the cost of a slight accuracy drop.
Context window	Maximum number of tokens the model can “see” in one go (prompts + generated messages history).
Inference	The process of running a trained model to compute outputs from given input without changing its parameters.
KV-cache	Key/Value tensors cached per token and per layer to avoid recomputing attention over previous tokens during generation in Transformer decoders.
Activations	Temporary intermediate outputs produced as the model processes inputs. They represent the internal state of the computation and are stored in memory for the duration of inference. Cleared after output is produced.
RAM Offloading	For GPU inference. Moving model parts (e.g., weights, KV-cache) from GPU VRAM to system RAM when VRAM is insufficient, with large speed tradeoffs.
GPU Offloading	For CPU-based inference. Moving compute-heavy tasks (e.g., KV-cache or matrix multiplication) to GPU to accelerate performance if VRAM allows.

Why VRAM Matters for Local LLMs

GPU VRAM usage shown in task manager. — GPU VRAM usage shown in the Windows task manager after loading an example LLM in LM Studio.

When running large language models (LLMs) on your own hardware, GPU VRAM (video memory) will in most cases be the limiting factor. All of the model’s weights must reside in VRAM during inference, alongside a few other things like the KV Cache we’re going to talk about in just a bit. The larger the model (the more parameters/weights it has), the more VRAM it needs.

For example, a 7 billion parameter model (7B) in full 16-bit precision would require roughly 14 GB of GPU memory just for its weights. By contrast, a huge 65B model would need over 130 GB in 16-bit – far beyond any single consumer GPU VRAM capacity.

Still, mind that as Maximilian Schwarzmüller notes in his superb write-up on LLMs (in this case about a 4-bit quantized 7B model example):

“A common misconception might be that if a model is, say, 7 billion parameters, it “only” needs 7GB of VRAM. However, besides the parameters themselves, VRAM also needs to hold the input data (prompt, context), intermediate calculations (activations), and the output.”
https://maximilian-schwarzmueller.com/articles/llms-gpu-cpu-vram-ram/

Running out of VRAM on model load and/or later on during inference will force your GPU to offload some of the model’s data to system RAM (or in extreme cases, to your system drive).

Once the model data becomes split between your GPU VRAM and main system RAM, the calculations on the model’s weights which happen during inference can become problematic due to the way offloaded data has to be handled by your LLM inference software.

This can dramatically slow down text generation, either because of the large latency caused by moving the model’s tensors from RAM to VRAM layer-by-layer, or the software environment switching fully to CPU-only inference. This behavior, once again, will depend on the software you’ll decide to use.

In short, more VRAM = ability to run bigger models with faster responses and larger context windows. But even if you don’t have a top-end GPU, careful model selection and optimization (like selecting an appropriate model quantization for your needs) can let you run surprisingly capable LLMs on mid-range hardware.

Let’s now quickly break down how model size and memory precision affect VRAM requirements.

Model Size vs. Memory Requirement

Different 4-bit quantized models in the LM Studio interface. — 4-bit quantizations of a few different 8B models.

An LLM’s size in VRAM is primarily determined by model weights/parameters– and the precision each weight is stored in. At FP32 precision, each parameter uses 4 bytes. At FP16/BF16 precision, each uses 2 bytes. In practical terms, that means:

A 7B model in FP16 takes ~14 GB (7 billion × 2 bytes) of VRAM for weights alone.
A 13B model in FP16 is ~26 GB (13 billion × 2 bytes) – more than a typical single 24 GB GPU can hold.
Larger models (30B, 70B, etc.) would scale up further (30B FP16 ≈ 60 GB; 70B FP16 ≈ 140 GB), which is way beyond single-GPU VRAM limits without getting into memory optimization techniques like quantization.

These figures however are just for the static model. Inference also uses some extra memory for activations and attention history. As noted by eleuther.ai, the rough rule of thumb is to add ~20% overhead for runtime memory when planning out your VRAM usage, as they had put it:

“In addition to the memory needed to store the model weights, there is also a small amount of additional overhead during the actual forward pass. In our experience this overhead is ≤ 20% and is typically irrelevant to determining the largest model that will fit on your GPU.”
https://blog.eleuther.ai/transformer-math/#memory-requirements

For example, a 13B model at FP16, with a 4096-token context sequence needs about 34 GB in total (weights + overhead), although its weights alone account only for about 26GB of its size in memory. This small overhead can change depending on a few factors.

If you’re interested in the concrete underlying math that is going on under the hood when your model is performing inference calculations, this is a resource that you’ll be very happy to read through: Transformer Inference Arithmetics – kipply’s blog

Quantization: Fitting Models into Less VRAM

Quantization is the key to running large models on GPUs with limited memory. Quantization simply means using lower-bit representations for model weights instead of 16-bit or 32-bit floats. As we’ve already touched upon that topic above, let’s take a while to explain that even further.

You can picture it in the following way: each of the model weights contains a piece of valuable information which the model will utilize during inference. These weights are simple floating point values which can be represented with various levels of accuracy (32, 16, 8, 4 bits per one value and so on).

Common formats are 8-bit (INT8) and 4-bit (INT4), which use 1 byte or 0.5 bytes per parameter, respectively. By compressing the model, quantization can cut VRAM requirements dramatically – often with minimal impact on model quality.

Modern quantization methods can maintain surprising accuracy. For example, even the GPTQ 4-bit quantizations are reported to have relatively small in output quality compared to the main full FP16 models. There are many papers where you can read much more about various LLM quantization strategies.

Many users find that 8-bit and 4-bit models perform nearly as well as 16-bit ones when it comes to inference. The huge benefit here is memory savings: switching from a FP16 (16-bit) model to one with 8-bit weight representations halves the VRAM needs, and going to 4-bit effectively further cuts it to one-quarter.

In practice, this means that a 7B model that required ~14 GB in FP16 can shrink to about 7 GB at 8-bit or 3.5 GB at 4-bit with proper quantization. It works in the exact same for 11, 13B models, and so on. These trimmed-down versions let you load bigger models on smaller cards.

Quantization formats: You might encounter terms like GGUF or GPTQ. GGUF is a new unified file format in the llama.cpp ecosystem that supports multiple quantization levels and enables splitting model layers between CPU and GPU for efficient inference. GPTQ, on the other hand, is a popular GPU-focused quantization method that compresses model weights to reduce memory usage and speed up inference, as we’ve just explained above.

The good news is you usually don’t need to quantize a model yourself – many popular models are readily available in 8-bit or 4-bit forms (e.g. on Hugging Face), under appropriate labels. By choosing a quantized variant, you can run a much larger model on your GPU than you otherwise could.

Matching Model Size to Your Needs

Bigger isn’t always better for every use case – you should choose a model size that fits both your VRAM budget and your application. Here are some general guidelines on model sizes and typical use cases:

3B–7B models (small): Useful for basic chat, straightforward Q&A, and lightweight text generation. These can handle casual conversations or simple summarization, but may struggle with complex instructions or precise coding tasks. On the plus side, they run fast even on modest GPUs and can fit in very low VRAM (even on a 4GB card, with 4-bit quantization in low-context scenarios).
13B–15B models (medium): These mid-size models often show a noticeable jump in ability over 7B ones. Models like LLaMA-13B or Falcon-13B can follow more complex instructions, produce more coherent longer responses, and do better at coding or reasoning than 7B variants. They require more VRAM (typically 12+ GB if quantized, or 24+ GB at full precision), but are a sweet spot for many local use cases like basic chatbots, AI character roleplay and simplified code assistants, balancing quality and resource usage.
30B–40B models (large): High-capacity models that further improve on factual accuracy, coding, and reasoning. A 30B model (like LLaMA 3 30B or Falcon-40B) can produce answers closer to GPT-3.5/4o in terms of quality, especially when fine-tuned for chat or code. However, they demand significant VRAM – usually at least 15–20 GB with 4-bit quantization.
65B–70B models (very large): These are among the most capable open-source models available (LLaMA 3 70B, etc.). They excel at complex tasks, detailed reasoning, and maintaining conversation context, often outperforming smaller models by a good margin. The trade-off is they are extremely large and nearly impossible to efficiently use on a single consumer GPU. For instance, 70B in 4-bit still needs around 35–40 GB of VRAM in practice. Running these locally might require splitting across two 24 GB GPUs or getting into cloud GPU setups. If you only have a single card with 16-24GB of VRAM, you’ll likely want to stick to ≤30B models, or use CPU offloading for 70B with a slight performance hit (more on that later).

Models smaller than 3B in many cases can be used for simple language processing and editing tasks, as well as message processing or entity extraction. These however can rarely be used in more complex quality text generation workflows.

In summary, try to choose the smallest model that meets your task needs. If you’re just planning to chat with a local AI assistant or summarize short texts, a 7B or 13B model will most likely suffice. For more complex coding help or more accuracy, 13B+ is recommended, with ~30B being even better if you have the VRAM to spare. For the maximum attainable quality (approaching GPT-4-like performance on some comparative benchmarks), you’d look at 65B+ models, provided you have enough memory to run them locally.

The Impact of Context Length on VRAM

VRAM usage in GB and context window size in tokens relation chart. — KV cache grows linearly with the model’s context length. | Chart sourced from: Context Kills VRAM: How to Run LLMs on consumer GPUs – Lyx

So far we assumed a standard context length (prompt + generated tokens) of a few thousand tokens. Context length – how much text the model can consider at once – also affects memory usage. A longer context window means the model must store more past tokens activations (the “KV cache”) in memory during inference. This KV cache grows linearly with context length and model size. In practical terms, doubling the context (e.g. from 2048 tokens to 4096 tokens) can add several gigabytes to VRAM usage for a large model.

For example, running a 13B model with a 2048-token context is estimated to use about 1.6 GB just for the attention cache. At 4096 tokens, that can become two times that amount (~3.3 GB). As models with 8K or 16K context become available, the VRAM overhead for context can become significant.

As Lyx very accurately stated in his great article about the relation between VRAM use and model context size:

“You can run big models or long prompts on a consumer GPU (8–16 GB VRAM) — but rarely both.”
https://medium.com/@lyx_62906/context-kills-vram-how-to-run-llms-on-consumer-gpus-a785e8035632

What this means for you: If your GPU is tight on memory, you may need to limit the max context length or be cautious with very long prompts. Many 7B/13B models default to 2048 tokens, which is fine on most setups. If you use an extended context model (say LLaMA-2 7B with 4K or 8K context), expect higher memory usage.

In low-VRAM scenarios, you can also configure some software (like for instance LM Studio) to offload the KV cache to CPU RAM, trading speed for lower GPU memory usage. Keeping generation length (the number of tokens you ask the model to output) reasonable will also avoid blowing past your VRAM limits due to an accumulating cache. This has always been a problem with local AI character chats which extend over a dozen long messages on lower-VRAM systems.

In summary, longer context = more VRAM. Choose a context length that fits your GPU’s capacity, and remember that advertised maximums (like “4096 tokens”) are upper bounds; you don’t always have to use the full window if memory is a concern.

To predict how much space in your VRAM your chosen model/context-length combination will take up including other important factors, you can use plenty of existing LLM memory calculators, which are listed down below.

RAM Offloading: When VRAM Is Limited

What if your available VRAM just isn’t enough for the model you want to run? In these cases, you have options to offload some or all of the model data to your main system RAM. Many inference frameworks like Llama.cpp, and by extension local LLM software like LM Studio or KoboldCpp support this, allowing you to run models much larger than your available GPU VRAM, albeit with much slower performance.

Hugging Face Transformers (through the Accelerate library) can automatically distribute a model across GPU and CPU. By setting the parameter device_map="auto" when loading the model, it will try to fill your GPU memory first, then put overflow layers onto CPU memory, and as a very last resort use disk swap. This means even if a model technically needs 20 GB but you have 16 GB GPU, the remaining 4 GB can sit in system RAM. The trade-off is speed: any time data is pulled from CPU to GPU, generation will slow down due to the transfer latency. Still, this feature lets you experiment with larger models than your GPU could normally handle – for example, people have successfully loaded 70B models on 16–24 GB cards by offloading a portion to RAM.

Llama.cpp (a popular lightweight LLM runtime) offers a similar capability. It uses GGML/GGUF model files and allows specifying how many layers to keep on GPU vs CPU. If you don’t have enough VRAM for all layers, you can offload some to the CPU and still run the model. This is useful for long contexts or huge models on limited hardware. Do note that pure CPU inference is much slower – often orders of magnitude slower than using a GPU. So, while offloading enables functionality, it’s best to keep the majority of the model within the constraints of your video memory, if you can afford to do that.

Managing the offloaded data carefully: If you do rely on CPU offloading, ensure you have sufficient amount of RAM on your system. Be mindful that too much swapping can lead to stuttering outputs or timeouts. It’s often worth trying a smaller quantized model before resorting to heavy offloading of a larger one – you might get similar output quality without needing as much VRAM, and without the speed penalty, which might be very annoying and might be confusing especially when you’re just starting out.

In general, regardless of what inference environment you’re using, most of them will support offloading in one way or another. Use that to your advantage if you’re short on video memory!

VRAM Calculators For Local LLMs – A Short List

There already exists a whole lot of different calculators which are meant to help you figure out which large language model you’ll be able to run on your GPU with satisfactory generation speeds. While most of these are fairly accurate, if you’re just starting out most of the time our rule of thumb of taking the model size and adding a ~20% overhead will be just enough to determine your model’s size in VRAM.

If however, you’re interested in tools like these which can be very helpful for more complex local LLM workflows, here are the best ones I came across during my research:

APXML VRAM Calculator – great looking interface, quick & easy to use.
Asmirnov VRAM Estimator – a much simpler GUI, still pretty useful.
HF-Accelerate Model Memory Usage Tool – a very simple tool in which you can directly drop links to models from HuggingFace repositories.
AI Multiple Research LLM VRAM Calculator for Self-Hosting – another useful calculator served together with a neat explainer on LLM GPU requirements.
NyxKrage LLM VRAM Calculator – officially endorsed by the SillyTavern community.

With the right combination of model size, quantization, and context length, plus careful use of VRAM use estimators, these days almost anyone can run surprisingly powerful LLMs locally — even on mid-range hardware!

LLMs & Their Size In VRAM Explained – Quantizations, Context, KV-Cache

Quick Glossary – Key Terms Explained

Why VRAM Matters for Local LLMs

Model Size vs. Memory Requirement

Quantization: Fitting Models into Less VRAM

Matching Model Size to Your Needs

The Impact of Context Length on VRAM

RAM Offloading: When VRAM Is Limited

VRAM Calculators For Local LLMs – A Short List

Check out also:

6 Best Free & Fun Local AI Tools For Your PC! (Easy Setup!)

How To Install & Use The Virtual Audio Cable – VB-Audio For Windows

Veadotube Free PNGTuber Avatar Software How To – Tutorial (+OBS Integration)

Latest Articles

5 Websites That Let You Access Veo 3 Without Google AI Subscription

ImportError: Loading an AWQ quantized model requires auto-awq – Quick Fix

Basic ComfyUI SDXL Workflows – No Custom Nodes

Illustrious XL ComfyUI SDXL Anime Beginners Guide

9 Best Horizontal Retro Handhelds Of 2025 (16:9 & 4:3)