How KeepGPU Works¶
At runtime, KeepGPU spins up one lightweight worker per GPU. Each worker keeps a tensor allocated and runs a burst of CUDA ops, then sleeps. This convinces most schedulers that the GPU is still busy, without burning a full training workload.
Components¶
- CLI (Typer/Rich) – Parses options, validates GPU IDs, and configures the logger.
GlobalGPUController– Detects the current platform (CUDA, ROCm, or Mac M series) and instantiates one single-GPU controller per selected device.CudaGPUController/RocmGPUController/MacMGPUController– Platform-specific implementations for per-GPU keep-alive loops.- GPU monitor (NVML/ROCm) – Wraps
nvidia-ml-py(thepynvmlmodule) for CUDA telemetry and optionallyrocm-smiwhen installed by way of therocmextra. Mac M series does not support direct GPU utilization monitoring. - Utilities –
parse_sizeturns strings like1GiBinto bytes, whilesetup_loggerwires both console and file logging with optional colors.
CLI args ──▶ GlobalGPUController ──▶ [CudaGPUController rank=0]
[CudaGPUController rank=1]
[...]
Lifecycle¶
- The CLI (or your Python code) instantiates
GlobalGPUController. - During
keep()/__enter__, each Cuda worker: - Allocates a tensor sized by way of
vram_to_keep. - Starts a daemon thread that performs
matmul_iterationsfused activations. - Calls
_monitor_utilization(by way of NVML) to detect real activity. - If utilization exceeds
busy_threshold, the worker just sleeps for one moreinterval. Otherwise it runs a new batch of ops. - When you call
release()(or exit the context), every worker sets a stop event, joins the thread, and clears the CUDA cache.
Why matmuls?¶
Matrix multiplies:
- Allocate continuous VRAM quickly, which is what schedulers monitor.
- Exercise compute units enough to show non-zero utilization spikes.
- Are deterministic and easy to tune—adjust
matmul_iterationsto trade power draw for stronger “busy” signals.
Threading & responsiveness¶
- The keep-alive loop runs on daemon threads so the main process can exit fast.
GlobalGPUController.release()stops workers concurrently by way of threads, keeping shutdown time bounded even with many GPUs.- Errors inside a worker are logged but do not bring the whole process down; the loop retries after clearing the CUDA cache.
Platform detection¶
get_platform() inspects the system and enables the CUDA, ROCm, or Mac M series
(MPS) path. Detection order: CUDA → ROCm → Mac M → CPU fallback.