Best GPUs For Local LLMs In 2024 (My Top Picks!)

Oobabooga WebUI, koboldcpp, in fact, any other software made for easily accessible local LLM model text generation and chatting with AI models privately have similar best-case scenarios when it comes to the top consumer GPUs you can use with them to maximize performance. Here is my benchmark-backed list of 6 graphics cards I found to be the best for working with various open source large language models locally on your PC. Read on!

Want the absolute best graphics cards available this year? – I’ve got you covered! – Best GPUs To Upgrade To These Days (My Honest Take!)

Note: The cards on the list are ordered by their price. Read the descriptions for info regarding their performance!

This web portal is reader-supported, and is a part of the Amazon Services LLC Associates Program and the eBay Partner Network. When you buy using links on our site, we may earn an affiliate commission!

What Are The GPU Requirements For Local AI Text Generation?

A basic chat conversation with an AI model using the OobaBooga text generation WebUI.
Contrary to popular belief, for basic AI text generation with a small context window you don’t really need to have the absolute latest hardware – check out my tutorial here!

Running open-source large language models locally is not only possible, but extremely simple. If you’ve come across my guides on the topic, you already know that you can run them on GPUs with less than 8GB VRAM, or even without having a GPU in your system at all! But running the models isn’t quite enough. In an ideal world you want to get responses as fast as possible. For that, you need a GPU that is up for that task.

So, what are the things you should be looking for in a graphics card that is to be used for AI test generation with LLMs? One of the most important answers to this question is – a high amount of VRAM.

VRAM is the memory located directly on your GPU which is used when your graphics card processes data. When you run out of VRAM, the GPU has to “outsource” the data that doesn’t fit in its own memory to the main system RAM. And this is when trouble begins.

While your main system RAM is also very fast (in fact, in many cases just as fast as your GPU VRAM), the issue is that the time required to send the data from the GPU to system RAM and back is the thing that causes extreme slowdowns when the VRAM on your graphics card runs out.

Running out of VRAM is not only a problem that you might encounter when using LLMs, but also when generating images with Stable Diffusion, doing AI vocal covers for popular songs (see my guide for that here), and many other activities involving locally hosted artificial intelligence models.

There are also many other variables that count here. The number of tensor cores, amount and speed of cache memory and memory bandwidth of your GPU are also crucial. However, you can rest assured that all of the GPUs listed below meet the conditions that make them top-notch choices in terms of the usage with various AI models. If you want to learn even more about the technicalities involved, check out this neat explainer article here!

How Much VRAM Do You Really Need?

NVIDIA RTX 2070 SUPER with the OobaBooga WebUI.
Here are my generation speeds on my old NVIDIA RTX 2070 SUPER, reaching up to 20 tokens/s using the OobaBooga text generation WebUI.

The straightforward answer is: as much as you can get. The facts however are, that when it comes to consumer-grade graphics cards, for now there aren’t really many cards with more than 24GB of VRAM on board. If you want the absolute best, you should aim for these ones. An example of such a card on the high-end would be the NVIDIA GeForce RTX 4090 which I’ll cover in a short while.

The only other viable way to get more operational VRAM is to either connect multiple GPUs to your system (which requires both some technical skills and the right base hardware). In general though, 24GB of VRAM on a GPU will be able to handle most larger models you throw at them and is more than enough for most applications!

Is 8GB Of VRAM Enough For Playing Around With LLMs?

Yes, you can run some smaller LLM models even on a 8GB VRAM system, and as a matter of fact I did that exact thing in this guide on running LLM models for local AI assistant roleplay chats, reaching speeds for up to around 20 tokens per second with small context window on my old trusted NVIDIA GeForce RTX 2070 SUPER (~short 2-3 sentence message generated in just a few seconds). You can find the full guide here: How To Set Up The OobaBooga TextGen WebUI – Full Tutorial

While you certainly can run some smaller and lower-quality LLMs even on an 8GB graphics card, if you want higher output quality and reasonable generation speeds with larger context windows, you should really only consider cards having between 12 and 24GB of VRAM – and these are exactly the cards I’m about to list out for you!

Should You Consider Cards From AMD?

Our NVIDIA RTX 3070 Ti testing unit.
In most cases, especially if you’re a beginner when it comes to local AI and deep learning, it’s best to pick a graphics card from NVIDIA rather than AMD. Here is why.

This might be a tricky question for some. While AMD cards are certainly cheaper than the ones sold by NVIDIA (in most cases anyway), they are also known for certain driver and support issues that you might want to avoid, especially when dabbling in locally hosted AI models, not to mention the lack of CUDA support which makes AMD cards substantially slower in many AI-related applications. They are simply not great for ideal out-of-the-box experience with what we’re doing here, at least in my honest opinion.

Moreover, you should also know that many pieces of software such as the Automatic1111 WebUI, or the OobaBooga WebUI for text generation (and more), have different installation and configuration paths for AMD GPUs, and their support for the graphics cards other than the ones manufactured by NVIDIA is oftentimes rather bad. If you’re afraid of spending a lot of time troubleshooting your new setup, it’s best to stick with NVIDIA – trust me on this one.

Can Your Run LLMs Locally On Just Your CPU?

Intel I7-13700KF processor installed on a motherboard, closeup shot.
Gpt4All lets you run many open-source LLM models on your CPU. In that case the models are loaded directly into the main system RAM. In most cases CPU inference is slower compared to when using a GPU.

Yes! And one of the easiest ways to do that is to use the free open-source GPT4ALL software which you can use for generating text using AI without even having a GPU installed in your system.

Of course, keep in mind that for now, CPU inference with larger, higher quality LLMs can be much slower than if you were to use your graphics card for the process. But yes, you can easily get into simpler local LLMs, even if you don’t have a powerful GPU.

Now let’s move on to the actual list of the graphics cards that have proven to be the absolute best when it comes to local AI LLM-based text generation. Here we go!

1. NVIDIA GeForce RTX 4090 24GB

NVIDIA GeForce RTX 4090 24GB graphics card.

For now, the NVIDIA GeForce RTX 4090 is the fastest consumer-grade GPU your money can get you. While it’s certainly not cheap, if you really want top-notch hardware for messing around with AI, this is it.

The 24GB version of this card is without question the absolute best choice for local LLM inference and LoRA training if you only have the money to spare. It can offer amazing generation speed even up to around ~30-50 t/s (tokens per second) with right configuration. This guy over on Reddit even chained 4 of these together for his ultimate rig for handling even the most demanding LLMs. Check the current prices of this beautiful beast here!

With the clear and rather unsurprising winner out of the way, let’s move on to some more affordable options, shall we?

2. NVIDIA GeForce RTX 4080 16GB

NVIDIA GeForce RTX 4080 16GB graphics card.

The NVIDIA GeForce RTX 4080 comes right after the 4090 when it comes to performance. Where it lack however, is the VRAM department.

While there is a pretty notable performance gap between the 4080 and the 4090, the most important difference between these two cards is that the GeForce RTX 4080 maxxes out at 16GB of GDDR6X VRAM, which is significantly less than its successor has to offer.

As we’ve already established, for running large language models locally ideally you want as much VRAM as you can possibly get. Just because of that, the RTX 4080 would not be my first choice when picking a graphics card for that very purpose with a decent budget. Still, the 4080 offers great way-above-average performance and can yield surprisingly good results when it comes to text generation speed. It just won’t fit some larger LLM models which you could run without trouble on its older brother.

3. NVIDIA GeForce RTX 4070 Ti 12GB

NVIDIA GeForce RTX 4070 Ti 12GB 16GB graphics card.

The NVIDIA GeForce RTX 4070, while having even less VRAM than the 4080, is just a little bit more affordable than my first two picks, and it’s still one of the best performing GPUs on the market as of now.

This card, still making the overall top list of GPUs you can get this year, offers about two times the performance of the RTX 3060, and it does so for a pretty good price. If you can make do with 12GB of VRAM, this might just be a good choice for you.

This card, being perfectly honest is in a little bit of a weird place when it comes to its LLM use reliability. It doesn’t exactly give you a large amount of VRAM, when it comes to both benchmark and real-life performance it visibly falls behind the 4080 and the 4090, and sadly its price doesn’t seem to reflect that yet. If you’re looking for a better price/performance ratio, consider checking out the 3xxx series that I’m about to show you.

4. NVIDIA GeForce RTX 3090 Ti 24GB – Most Cost-Effective Option

NVIDIA GeForce RTX 3090 Ti 24GB graphics card.

With the NVIDIA GeForce RTX 3090 Ti, we’re stepping down from the price even more, but surprisingly, without sacrificing much performance. The 3090 alongside with the 3080 series are still among the most commonly chosen GPUs for LLM use.

In my personal experience confirmed by recorded user benchmarks, the 3090 Ti performance-wise comes right after the already mentioned 4070 Ti. When it comes to the price, this latest GPU from the NVIDIA 3xxx series is probably one of the best pieces of hardware on this list. The newest 4xxx generation of NVIDIA cards is still pretty overpriced, but the older models have already started slowly dropping prices with the end of the previous year.

So in other words, both the original 3090 (offering just a tad bit less performance and the same amount of video memory) and the 3090 Ti are the most cost-effective graphics cards on this list. If you absolutely don’t want to overpay, you can also get one of these second-hand. You can find quite a few 3090’s on Ebay for a very good price!

5. NVIDIA GeForce RTX 3080 Ti 12GB

NVIDIA GeForce RTX 3080 Ti 12GB graphics card.

After the 3090 Ti, quite naturally comes its predecessor, the NVIDIA GeForce RTX 3080 Ti. This GPU while having only 12GB of VRAM on board, is still a pretty good choice if you’re able to find a good deal for it.

The 3080 Ti and the 3090 Ti when it comes to their specs and real-world performance are really close together. When it comes to the on-board VRAM however, the 3090 Ti easily comes off as a better choice. With the little performance boost and a TDP larger by 100 watts, the 3080 Ti is in my eyes only worth it if you can find it used for cheap.

If you can, grab the 3090 Ti, or a base 3090 instead. If the price isn’t substantially better, there is no good choice to stick with the previous model, mainly because of the lesser amount of VRAM it has to offer. Now let’s move on to the real budget king which you might have been waiting for!

6. NVIDIA GeForce RTX 3060 12GB – The Best Budget Choice

NVIDIA GeForce RTX 3060 12GB graphics card.

The NVIDIA GeForce RTX 3060 with 12 GB of VRAM on board and a pretty low current market price is in my book the absolute best tight budget choice for local AI enthusiasts both when it comes to LLMs, and image generation.

I can already hear you asking: why is that? Well, the prices of the RTX 3060 have already fallen quite substantially, and its performance as you might have guessed did not. This card in most benchmarks is placed right after the RTX 3060 Ti and the 3070, and you will be able to most 7B or 13B models with moderate quantization on it with decent text generation speeds. With right model chosen and the right configuration you can get almost instant generations in low to medium context window scenarios!

As always, you can also look at some used GPU deals on Ebay when it comes to previous-gen graphics cards like this one! Finding a right one can make your purchase even more budget-friendly!

You might also like: Best GPUs For AI Training & Inference This Year – My Top List

Related Articles

Latest Articles