mantus.ai

FIVE MINUTES TODAY, A SMARTER TOMORROW

How do you get GPU acceleration working?

Install NVIDIA drivers and CUDA toolkit for GPU-accelerated AI training. Verify your setup works with PyTorch or TensorFlow and troubleshoot common issues.

Your GPU will accelerate AI training by orders of magnitude. Most deep learning frameworks depend on CUDA, NVIDIA's parallel computing platform. Getting this working on Ubuntu requires installing the right drivers and toolkit, then verifying everything connects properly.

Install NVIDIA drivers first

Ubuntu includes open source GPU drivers that work for basic display tasks. For AI work, you need NVIDIA's proprietary drivers that expose the full GPU capabilities.

Check what GPU you have:

lspci | grep -i vga

This shows your graphics card model. Most modern NVIDIA cards work well for AI. Even older GTX 1060s can train small models effectively.

Open the Additional Drivers application from your applications menu. Ubuntu detects NVIDIA hardware automatically and shows available driver versions. Pick the latest "tested" proprietary driver, not the open source nouveau option.

Apply the changes and reboot. After restart, verify the driver loaded:

nvidia-smi

This command displays your GPU information, current usage, and available memory. If you see a table with your GPU listed, the driver works.

Install the CUDA toolkit

CUDA provides the low level interface that PyTorch and TensorFlow use for GPU acceleration. The version matters. Different AI frameworks support different CUDA versions.

Check what CUDA version your current PyTorch installation expects:

python3 -c "import torch; print(torch.version.cuda)"

If you haven't installed PyTorch yet, check the PyTorch website for the recommended CUDA version for your desired PyTorch release.

Download CUDA from NVIDIA's website or use Ubuntu's repositories. The Ubuntu packages may lag behind but are often easier to maintain. The "runfile" installer gives you more control but can cause conflicts.

Make the installer executable and run it:

chmod +x cuda_12.1.0_530.30.02_linux.run
sudo ./cuda_12.1.0_530.30.02_linux.run

During installation, deselect the driver installation option since you already installed drivers. Install only the CUDA toolkit and samples.

Add CUDA to your system path by editing your shell configuration:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Restart your terminal or run source ~/.bashrc to load the new paths.

Verify GPU acceleration works

Install PyTorch with CUDA support:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

The cu121 suffix specifies CUDA 12.1 compatibility. Adjust this to match your CUDA version.

Test that PyTorch can see your GPU:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA devices: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.get_device_name(0)}")

For TensorFlow users, install TensorFlow (GPU support is included by default in recent versions):

pip3 install tensorflow

Test TensorFlow GPU support:

import tensorflow as tf
print("GPUs available:", tf.config.list_physical_devices('GPU'))

Both frameworks should detect your GPU and display device information.

Common problems and fixes

"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" means the driver didn't load properly. This usually happens after Ubuntu kernel updates. Reinstall the NVIDIA driver through Additional Drivers.

CUDA version mismatches cause "RuntimeError: CUDA runtime error" messages. Check your CUDA installation with nvcc --version and ensure your PyTorch version matches. The PyTorch website provides installation commands for specific CUDA versions.

Out of memory errors happen when your GPU runs out of VRAM during training. Reduce batch size, use gradient checkpointing, or enable mixed precision training. Monitor memory usage with nvidia-smi while training.

cuDNN missing errors require installing additional NVIDIA libraries. Download cuDNN from NVIDIA's website (requires free account) and extract it to your CUDA directory.

GPU acceleration transforms AI development from waiting hours to minutes for training runs. Once working, you'll wonder how anyone does machine learning without it.