When it comes to local AI workflows such as Stable Diffusion image generation or LLM inference, but also modern games requiring a lot of VRAM to run well, having enough video memory on your GPU to load your models is one of the most important things to take care of. Here are the best ways I found to manage local AI models when I was still using an 8GB VRAM GPU.
You might also like: 12 Best High VRAM GPU Options This Year (Consumer & Enterprise)
What Actually Takes Up Lots of VRAM on Your System
Your operating system’s window manager, background applications, and browser tabs all can use relatively large portions of your GPU’s dedicated memory. Many of those do so even if they are just left open in the background.
On a fresh Windows 11 boot at 2K resolution on a main display and two additional screens, you might already be losing up to 400-600MB of VRAM before you even launch a single application. While this might not seem like much, it really is on low-VRAM systems with only 8-12GB of memory to spare.
While tricks such as dropping your Windows desktop resolution or turning off system-wide hardware acceleration are legitimate ways to free up some memory, they can only do so much. Below, I’ll show you a few useful tricks I utilized when I used to train SDXL LoRA models on my old 8GB VRAM GPU.
1. Image/Video Editing Software (Adobe Photoshop, DaVinci Resolve, etc.)
Software like Adobe Photoshop, Premiere Pro, or DaVinci Resolve are rather well known for utilizing large amounts of video memory which oftentimes doesn’t get cleared until you exit the software completely. Even when idle with a simple project open, they reserve huge chunks of video memory for GPU-accelerated filters, canvas rendering, and caching.
A simple 1080p project in DaVinci Resolve can easily reserve 2GB to 4GB of VRAM, leaving little room for your AI models. The very same goes for Adobe Photoshop with a few simple images open.
If you’re doing any kind of local AI work on your system, it’s best to fully close these applications before you even begin. In the image above in the Windows Process Explorer interface you can see the Adobe Photoshop software with a few high-resolution photos open taking up over 1GB of video memory.
For a detailed guide on monitoring and tracking your VRAM usage on your system, you can check out my short guide about that here: How To Check Per-Process VRAM Usage On Windows 11
2. Web Browsers With Hardware Acceleration Enabled
Most, if not all modern internet browsers including Chrome, Firefox, and Microsoft Edge use hardware acceleration by default to render web pages and decode video more smoothly with the use of your graphics card.
While this can make the playback of high resolution video streams faster, it does use up some video memory of your GPU – more with every video/VRAM-heavy web app tab you have open. A browser with 10+ active tabs, especially if any contain YouTube videos or graphics-heavy WebGL apps can easily steal a large chunk of your available VRAM.
Luckily you can prevent your browser from using your GPU altogether – either temporarily, or permanently. Go to your browser settings and there search for the graphics/hardware acceleration setting, and toggle it off. Then, restart your browser to make sure that it takes effect.
This will force the CPU to handle the in-browser content rendering, freeing up the memory on your GPU, which can in turn be used in whatever VRAM-intensive workflow you’re running. The more tabs with graphics-heavy content you had open, the larger your memory savings will be.
3. Screen Recording & Streaming Software (OBS, Streamlabs, etc.)
If you stream your AI workflow or record tutorials, software like OBS Studio will typically be set to use a GPU-based encoder by default to process video data, for better performance. This is good for keeping the load off your CPU, but it requires dedicated VRAM overhead to function. Recording a 4K canvas at 60fps can easily consume hundreds of megabytes of video memory in the background without you noticing.
If you are approaching your VRAM limit and absolutely must record or stream, you can try switching your encoder from your main GPU to a CPU-based method (x264). However, a word of warning: CPU encoding is extremely resource-heavy. If you do this, you must drop your recording resolution to 1080p, or your system can easily start struggling. Alternatively, if your CPU has integrated graphics (Intel QuickSync or AMD iGPU), select that as your encoder instead for better performance.
4. Desktop Window Manager & High Display Resolutions
If you’re a Windows user, the Windows Desktop Window Manager (dwm.exe) is a process on your system responsible for rendering your desktop interface. Its VRAM usage will scale directly with your monitor resolution, HDR settings, and the number of displays connected to your PC. Running a dual-monitor setup with a 4K primary display and HDR enabled requires a much larger framebuffer than a single 1080p screen. Higher resolutions and refresh rates and more monitors connected will up the memory requirements even higher.
If you feel like your display configuration is somewhat more elaborate and you could really use an extra ~200-400MB VRAM to fit a model (depending on your setup), you could always try lowering your desktop resolution, turning off HDR in Windows Display Settings, and disconnecting secondary monitors.
Note: This really is an inconvenient brute-force method, and on most systems it won’t make a noticeable difference. Still, if you’re rocking three high-resolution displays on an 8GB VRAM GPU just like I used to do, this might be a useful tip for you.
5. Electron/CEF Based Apps (Discord, Spotify, VS Code)
Many modern desktop apps like Discord, Spotify, Slack, or VS Code to name a few, are built on either the Electron, or the CEF (Chromium Embedded) frameworks. To simplify things: each one of those can be treated as a dedicated web browser instance running in the background of your system. Many of these apps, just like the web browsers we’ve already talked about, will default to using GPU hardware acceleration, in return taking up portions of video memory on your graphics card.
Having Discord, Spotify, and Steam running simultaneously in the background can quietly reserve large chunks of your available VRAM. If you don’t believe me, and you have Discord and Steam open, for instance, you can open your task manager, and keep an eye on your video memory usage (total or per-process) before, and after you close these applications.
Of course, you could also, just like it was the case with our web browsers, disable the hardware acceleration in these apps, sacrificing some performance for more available memory. These options are available for Discord and Steam in their setting menus.
You can learn much more about LLM quantization levels and VRAM usage here: LLMs & Their Size In VRAM Explained – Quantizations, Context, KV-Cache
6. Use Inference Software & Models Optimized for Low-VRAM Environments
Besides the options to reduce the VRAM usage by the software unrelated to your local AI workflows, you can also optimize the video memory usage of your workflows themselves. For Stable Diffusion local image generation, switching from the standard Automatic1111 to Stable Diffusion WebUI Forge can potentially reduce VRAM requirements by up to 1GB-1.5GB with notable speed increase, just because of the fixes introduced by the developers of this fork of the software.
For LLMs, sticking to GGUF quantized models with compressed bit-depths (e.g., Q4, Q5) is almost always a good idea when being short on video memory. These can reduce the model’s memory footprint compared to full precision versions, allowing you to run much larger smart models on cards that would otherwise have to depend on slow RAM offloading.
When it comes to model training and fine-tuning workflows, especially if you’re a beginner, I really do recommend that you gather more information about the software you’re using for these purposes. As an example, training Stable Diffusion LoRA models with lower batch sizes, more memory-efficient optimizers, and adjusted settings can significantly reduce memory overhead, making it possible to train SDXL LoRAs even on systems with just 8GB of video memory.
If you’re interested in what settings in the Kohya GUI can help you reduce your VRAM usage when training LoRAs, you definitely need to check out my quick guide on the training settings here: Kohya LoRA Training Settings Explained.
Monitor Your GPU VRAM Usage Effectively

On Windows, by default the Task Manager GPU tab in the “Performance” menu will only give you a total dedicated/shared GPU memory usage graph, which will often only be enough to see whether or not your training/inference model data is getting offloaded to your main system RAM, slowing down your workflows by a huge factor.
To pinpoint the software that is using up large chunks of your VRAM, you need to be able to see the per-process memory usage. On Windows it’s a matter of enabling the “Dedicated GPU Memory” column in the “Details” tab of Task Manager, or using the Windows Process Explorer. For Linux users, nvitop, gpustat, and amdgpu_top for AMD cards are all viable options.
For a quick tutorial on how to find the processes that are hogging your VRAM in the background, you can check out my guide on how to monitor per-process VRAM usage on Windows 11.
“System Memory Fallback” on NVIDIA Systems
One more important thing: Modern NVIDIA drivers (536.40+) utilize a “System Memory Fallback” feature. Depending on your settings, your GPU will either crash the application (OOM) when VRAM is full, or it will start offloading data to your significantly slower system RAM.
While this behavior actively prevents your software from crashing, it can absolutely tank its performance, oftentimes resulting in much longer inference times both when it comes to diffusion models, and local LLMs. You can manage this behavior in the NVIDIA Control Panel to prefer “No System Fallback” if you want to strictly stay within VRAM limits to maintain speed. Note that if you disable this and exceed your VRAM, your application will crash instantly (OOM) rather than slowing down. This is good for debugging, but bad for unsaved work.
Either way, you should always keep your loaded models within the constraints of your GPU VRAM for the best performance regardless of the workflow.
GPU Upgrade Options With 16GB+ Video Memory

If you have optimized everything and still hit OOM errors or you see your models getting offloaded to your system RAM, you are simply hitting the physical limits of your hardware. There is only so much you can do when it comes to local AI once you decide you want to do more in terms of generated image resolutions, running better local LLMs with larger context windows, etc.
For most people looking for a reasonably priced GPU upgrade, 16GB is the new safe baseline for comfortable SDXL generation and running mid-sized LLMs locally on your system. Cards like the NVIDIA RTX 5080, the RTX 4060 Ti (16GB version), as well as the propositions from AMD like the AMD Radeon RX 7900 XT 20GB, are currently the most accessible entry point for higher VRAM workflows.
For more serious enthusiasts, the much older but still very reliable RTX 3090/Ti 24GB, or RTX 4090 24GB, RTX 5090 32GB, as well as the AMD Radeon RX 7900 XTX are currently one of the only 24GB+ consumer cards available on the market.
If you’d like to go a step further, and get into the workstation/enterprise GPUs, there are many options you might have not even heard of up to this point. You can learn much more about that here: 12 Best High VRAM GPU Options This Year (Consumer & Enterprise)
Here is my budget GPU list that I update frequently as new options hit the market, you can use it to get the gist of the current GPU prices: Top 7 Best Budget GPUs for AI & LLM Workflows







