Oobabooga WebUI, koboldcpp, in fact, any other software made for easily accessible local LLM model text generation and chatting with AI models privately have similar best-case scenarios when it comes to the top consumer GPUs you can use with them to maximize performance. Here is my benchmark-backed list of 6 graphics cards I found to be the best for working with various open source large language models locally on your PC. Read on!
And here you can find a similar list, but for AMD graphics cards – 6 Best AMD Cards For Local AI & LLMs This Year
This web portal is reader-supported, and is a part of the Amazon Services LLC Associates Program and the eBay Partner Network. When you buy using links on our site, we may earn an affiliate commission!
What Are The GPU Requirements For Local AI Text Generation?
Running open-source large language models locally is not only possible, but extremely simple. If you’ve come across my guides on the topic, you already know that you can run them on GPUs with less than 8GB VRAM, or even without having a GPU in your system at all! But running the models isn’t quite enough. In an ideal world you want to get responses as fast as possible. For that, you need a GPU that is up for that task.
So, what are the things you should be looking for in a graphics card that is to be used for AI test generation with LLMs? One of the most important answers to this question is – a high amount of VRAM.
VRAM is the memory located directly on your GPU which is used when your graphics card processes data. When you run out of VRAM, the GPU has to “outsource” the data that doesn’t fit in its own memory to the main system RAM. And this is when trouble begins.
While your main system RAM is also very fast (in fact, in many cases just as fast as your GPU VRAM), the issue is that the time required to send the data from the GPU to system RAM and back is the thing that causes extreme slowdowns when the VRAM on your graphics card runs out.
Running out of VRAM is not only a problem that you might encounter when using LLMs, but also when generating images with Stable Diffusion, doing AI vocal covers for popular songs (see my guide for that here), and many other activities involving locally hosted artificial intelligence models.
There are also many other variables that count here. The number of tensor cores, amount and speed of cache memory and memory bandwidth of your GPU are also crucial. However, you can rest assured that all of the GPUs listed below meet the conditions that make them top-notch choices in terms of the usage with various AI models. If you want to learn even more about the technicalities involved, check out this neat explainer article here!
How Much VRAM Do You Really Need?
The straightforward answer is: as much as you can get. The facts however are, that when it comes to consumer-grade graphics cards, for now there aren’t really many cards with more than 24GB of VRAM on board. If you want the absolute best, you should aim for these ones. An example of such a card on the high-end would be the NVIDIA GeForce RTX 4090 which I’ll cover in a short while.
The only other viable way to get more operational VRAM is to either connect multiple GPUs to your system (which requires both some technical skills and the right base hardware). In general though, 24GB of VRAM on a GPU will be able to handle most larger models you throw at them and is more than enough for most applications!
Is 8GB Of VRAM Enough For Playing Around With LLMs?
Yes, you can run some smaller LLM models even on a 8GB VRAM system, and as a matter of fact I did that exact thing in this guide on running LLM models for local AI assistant roleplay chats, reaching speeds for up to around 20 tokens per second with small context window on my old trusted NVIDIA GeForce RTX 2070 SUPER (~short 2-3 sentence message generated in just a few seconds). You can find the full guide here: How To Set Up The OobaBooga TextGen WebUI – Full Tutorial
While you certainly can run some smaller and lower-quality LLMs even on an 8GB graphics card, if you want higher output quality and reasonable generation speeds with larger context windows, you should really only consider cards having between 12 and 24GB of VRAM – and these are exactly the cards I’m about to list out for you!
Should You Consider Cards From AMD?
This might be a tricky question for some. While AMD cards are certainly cheaper than the ones sold by NVIDIA (in most cases anyway), they are also known for certain driver and support issues that you might want to avoid, especially when dabbling in locally hosted AI models, the lack of CUDA support which can make AMD cards slower when it comes to certain AI-related applications.
This situation however changes rather dynamically, and as the time goes by, more and more open-source software for hosting large language models locally such as Ollama, LM Studio and OobaBooga WebUI now do have support for AMD GPUs, and their AMD-compatible versions can work with Radeon graphics cards without much trouble. Still, if you’re afraid of spending a lot of time troubleshooting your new setup, or you’re not certain what kind of software you’ll be using just yet, it’s best to stick with NVIDIA.
If you want to check out a list of AMD GPUs I recommend for local LLM software, you’re in luck! I’ve just put one together for you here: 6 Best AMD Cards For Local AI & LLMs In Recent Months
Can Your Run LLMs Locally On Just Your CPU?
Yes! And one of the easiest ways to do that is to use the free open-source GPT4ALL software which you can use for generating text using AI without even having a GPU installed in your system.
Of course, keep in mind that for now, CPU inference with larger, higher quality LLMs can be much slower than if you were to use your graphics card for the process. But yes, you can easily get into simpler local LLMs, even if you don’t have a powerful GPU.
Now let’s move on to the actual list of the graphics cards that have proven to be the absolute best when it comes to local AI LLM-based text generation. Here we go!
1. NVIDIA GeForce RTX 4090 24GB
For now, the NVIDIA GeForce RTX 4090 is the fastest consumer-grade GPU your money can get you. While it’s certainly not cheap, if you really want top-notch hardware for playing around with AI, this is it.
The 24GB version of this card is without question the absolute best choice for local LLM inference and LoRA training if you only have the money to spare. It can offer amazing generation speed even up to around ~30-50 t/s (tokens per second) with right configuration. This guy over on Reddit even chained 4 of these together for his ultimate rig for handling even the most demanding LLMs. Check the current prices of this beautiful beast here!
With the clear and rather unsurprising winner out of the way, let’s move on to some more affordable options, shall we?
2. NVIDIA GeForce RTX 4080 16GB (+RTX 4080 Super)
The NVIDIA GeForce RTX 4080 comes right after the 4090 when it comes to performance. Where it lack however, is the VRAM department.
While there is a pretty notable performance gap between the 4080 and the 4090, the most important difference between these two cards is that the GeForce RTX 4080 maxxes out at 16GB of GDDR6X VRAM, which is significantly less than its successor has to offer.
As we’ve already established, for running large language models locally ideally you want as much VRAM as you can possibly get. Just because of that, the RTX 4080 would not be my first choice when picking a graphics card for that very purpose with a decent budget. Still, the 4080 offers great way-above-average performance and can yield surprisingly good results when it comes to text generation speed. It just won’t fit some larger LLM models which you could run without trouble on its older brother.
The NVIDIA RTX 4080 Super which was released quite recently, surprisingly comes very close to the regular 4080 in terms of benchmark scores. As it has the same 16GB of video memory on board, it’s hard for me to recommend it over the base model for any purpose including gaming. The jump in performance simply isn’t enough, and there are much better options out there.
3. NVIDIA GeForce RTX 4070 Ti Super 16GB – A New Contender
The NVIDIA GeForce RTX 4070 Ti Super in addition to having a long name, has also surpassed the RTX 4070 Ti in terms of performance after its market release, even if just by a little.
This card however isn’t just a slightly better performing 4070 Ti. It also got a significant VRAM upgrade, meaning that it’s essentially a faster 4070 Ti with 16 GB of GDDR6X video memory and with a 256-bit memory bus (compared to the 192-bit connection on all of the other 4070’s).
This places the 4070 Ti Super on this list one place higher than the RTX 4070 Ti which by the way is still a very reliable card if you can accept its lower VRAM capacity and speed. Still, this is a card that in my opinion is really worth it if you’re upgrading from the previous generation, or just starting out without any previous NVIDIA GPU. If you’re interested in the slightly cheaper base 4070 Ti, read on!
4. NVIDIA GeForce RTX 4070 Ti 12GB (+RTX 4070 Super)
The NVIDIA GeForce RTX 4070 Ti, while having even less VRAM than the 4080, is just a little bit more affordable than my first two picks, and it’s still one of the best performing GPUs on the market as of now.
This card, still making the overall top list of GPUs you can get this year, offers about two times better performance than the RTX 3060, and it does so for a pretty good price. If you can make do with 12GB of VRAM, this might just be a good choice for you.
This card, being perfectly honest is in a little bit of a weird place when it comes to its LLM use reliability. It doesn’t exactly give you a large amount of VRAM, when it comes to both benchmark and real-life performance it visibly falls behind the 4080 and the 4090, and sadly its price doesn’t seem to reflect that yet. If you’re looking for a better price/performance ratio, consider checking out the 3xxx series that I’m about to show you.
And then come the two newly released cards, the NVIDIA RTX 4070 Super, and the previously mentioned NVIDIA RTX 4070 Ti Super. Yes, it was confusing for me too at first. The situation is basically like this: the RTX 4070 Super alongside the base 4070 in terms of both real-life and benchmark performance fall behind the 4070 Ti by quite a bit, and the 4070 Ti Super is a direct upgrade of the Ti version with faster clock speeds, more VRAM and larger memory bus, as mentioned before. Both the 4070 Ti and Ti Super are great choices here.
Want the absolute best graphics cards available this year? – I’ve got you covered! – Best GPUs To Upgrade To These Days (My Honest Take!)
5. NVIDIA GeForce RTX 3090 Ti 24GB – Most Cost-Effective Option
With the NVIDIA GeForce RTX 3090 Ti, we’re stepping down from the price even more, but surprisingly, without sacrificing much performance. The 3090 alongside with the 3080 series are still among the most commonly chosen GPUs for LLM use.
In my personal experience confirmed by recorded user benchmarks, the 3090 Ti performance-wise comes right after the already mentioned 4070 Ti. When it comes to the price, this latest GPU from the NVIDIA 3xxx series is probably one of the best pieces of hardware on this list. The newest 4xxx generation of NVIDIA cards is still pretty overpriced, but the older models have already started slowly dropping prices with the end of the previous year.
Learn more about this GPU here: NVIDIA GeForce 3090/Ti For AI Software – Is It Still Worth It?
So in other words, both the original 3090 (offering just a tad bit less performance and the same amount of video memory) and the 3090 Ti are the most cost-effective graphics cards on this list. If you absolutely don’t want to overpay, you can also get one of these second-hand. You can find quite a few 3090’s on Ebay for a very good price!
6. NVIDIA GeForce RTX 3080 Ti 12GB
After the 3090 Ti, quite naturally comes its predecessor, the NVIDIA GeForce RTX 3080 Ti. This GPU while having only 12GB of VRAM on board, is still a pretty good choice if you’re able to find a good deal for it.
The 3080 Ti and the 3090 Ti when it comes to their specs and real-world performance are really close together. When it comes to the on-board VRAM however, the 3090 Ti easily comes off as a better choice. With the little performance boost and a TDP larger by 100 watts, the 3080 Ti is in my eyes only worth it if you can find it used for cheap.
If you can, grab the 3090 Ti, or a base 3090 instead. If the price isn’t substantially better, there is no good choice to stick with the previous model, mainly because of the lesser amount of VRAM it has to offer. Now let’s move on to the real budget king which you might have been waiting for!
7. NVIDIA GeForce RTX 3060 12GB – The Best Budget Choice
The NVIDIA GeForce RTX 3060 with 12 GB of VRAM on board and a pretty low current market price is in my book the absolute best tight budget choice for local AI enthusiasts both when it comes to LLMs, and image generation.
I can already hear you asking: why is that? Well, the prices of the RTX 3060 have already fallen quite substantially, and its performance as you might have guessed did not. This card in most benchmarks is placed right after the RTX 3060 Ti and the 3070, and you will be able to most 7B or 13B models with moderate quantization on it with decent text generation speeds. With right model chosen and the right configuration you can get almost instant generations in low to medium context window scenarios!
As always, you can also look at some used GPU deals on Ebay when it comes to previous-gen graphics cards like this one! Finding a right one can make your purchase even more budget-friendly!
You might also like: Best GPUs For AI Training & Inference This Year – My Top List