Skip to content

How KeepGPU Works

At runtime, KeepGPU spins up one lightweight worker per GPU. Each worker keeps a tensor allocated and runs a burst of CUDA ops, then sleeps. This convinces most schedulers that the GPU is still busy, without burning a full training workload.

Components

  1. CLI (Typer/Rich) – Parses options, validates GPU IDs, and configures the logger.
  2. GlobalGPUController – Detects the current platform (CUDA, ROCm, or Mac M series) and instantiates one single-GPU controller per selected device.
  3. CudaGPUController / RocmGPUController / MacMGPUController – Platform-specific implementations for per-GPU keep-alive loops.
  4. GPU monitor (NVML/ROCm) – Wraps nvidia-ml-py (the pynvml module) for CUDA telemetry and optionally rocm-smi when installed by way of the rocm extra. Mac M series does not support direct GPU utilization monitoring.
  5. Utilitiesparse_size turns strings like 1GiB into bytes, while setup_logger wires both console and file logging with optional colors.
CLI args ──▶ GlobalGPUController ──▶ [CudaGPUController rank=0]
                                      [CudaGPUController rank=1]
                                      [...]

Lifecycle

  1. The CLI (or your Python code) instantiates GlobalGPUController.
  2. During keep() / __enter__, each Cuda worker:
  3. Allocates a tensor sized by way of vram_to_keep.
  4. Starts a daemon thread that performs matmul_iterations fused activations.
  5. Calls _monitor_utilization (by way of NVML) to detect real activity.
  6. If utilization exceeds busy_threshold, the worker just sleeps for one more interval. Otherwise it runs a new batch of ops.
  7. When you call release() (or exit the context), every worker sets a stop event, joins the thread, and clears the CUDA cache.

Why matmuls?

Matrix multiplies:

  • Allocate continuous VRAM quickly, which is what schedulers monitor.
  • Exercise compute units enough to show non-zero utilization spikes.
  • Are deterministic and easy to tune—adjust matmul_iterations to trade power draw for stronger “busy” signals.

Threading & responsiveness

  • The keep-alive loop runs on daemon threads so the main process can exit fast.
  • GlobalGPUController.release() stops workers concurrently by way of threads, keeping shutdown time bounded even with many GPUs.
  • Errors inside a worker are logged but do not bring the whole process down; the loop retries after clearing the CUDA cache.

Platform detection

get_platform() inspects the system and enables the CUDA, ROCm, or Mac M series (MPS) path. Detection order: CUDA → ROCm → Mac M → CPU fallback.