Getting Started¶

This page helps you install KeepGPU, confirm the CLI can see your hardware, and understand the minimum knobs you need to keep a GPU occupied.

Requirements¶

Python 3.9 through 3.13 (matching the version in your environment/cluster image).
A PyTorch build that matches your target platform: CUDA, ROCm/HIP, Mac M series MPS, or CPU-only for import and docs workflows.
CUDA utilization monitoring uses NVML by way of nvidia-ml-py (the pynvml module); nvidia-smi is useful only as an external driver sanity check. ROCm telemetry uses rocm_smi when the ROCm/system stack provides it, and KeepGPU handles a missing module gracefully.

Platforms

CUDA is the most common path; ROCm requires a ROCm-enabled PyTorch build. The rocm extra remains available as an install-compatible marker but does not install ROCm SMI from PyPI. Mac M series (M1/M2/M3/M4) is supported by way of the macm extra using Metal Performance Shaders (MPS). CPU-only environments can import the package but controllers will not start.

Install¶

Stable release (PyPI)CUDA (example: cu121)ROCm (example: rocm6.1)CPU-onlyEditable dev installMac M series (M1/M2/M3/M4)

bash pip install keep-gpu

bash pip install --index-url https://download.pytorch.org/whl/cu121 torch pip install keep-gpu

bash pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch pip install keep-gpu[rocm] Install the ROCm-compatible PyTorch build that matches your runtime. ROCm SMI comes from the ROCm/system stack, not from the rocm extra.

bash pip install torch pip install keep-gpu

bash git clone https://github.com/Wangmerlyn/KeepGPU.git cd KeepGPU pip install -e .[dev]

bash pip install torch pip install keep-gpu[macm] MPS (Metal Performance Shaders) backend will be used automatically on Apple Silicon Macs. Note: GPU utilization monitoring is not available on macOS.

Pick your interface¶

CLI – fastest way to start keepalive sessions from a shell; see CLI Playbook.
Python module – embed keep-alive loops inside orchestration code; see Python API Recipes.
MCP server – expose KeepGPU over JSON-RPC (stdio or HTTP) for agents; see MCP Server.

Sanity check¶

Make sure PyTorch can see at least one device: bash python -c "import torch; print(torch.cuda.device_count())" A non-zero integer indicates at least one CUDA or ROCm/HIP device is visible to PyTorch.

On Mac M series: bash python -c "import torch; print(torch.backends.mps.is_available())" Should print True on Apple Silicon Macs.

Run the CLI in dry form (press Ctrl+C after a few seconds): bash keep-gpu --gpu-ids 0 --interval 30 --vram 512MB This keeps the sanity check on visible GPU 0; omit --gpu-ids only when you intentionally want all visible GPUs. On CUDA or ROCm, you should see Rich logs showing the GPUs being kept awake. Mac M series utilization telemetry is unavailable, so the default non-negative busy_threshold backs off instead of allocating keep tensors or running MPS compute. Use --busy-threshold -1 only when you intentionally want unconditional Mac M keepalive work.

Lower-power keep-alive

KeepGPU uses intervalled elementwise ops (not big matmul floods) so you can keep schedulers happy while keeping power and thermals modest.

Your first keep-alive loop¶

keep-gpu --interval 120 --gpu-ids 0 --vram 1GiB

--interval controls the finite positive sleep between utilization checks (seconds), including fractional values such as 0.5. Values above the Python runtime wait limit are rejected.
--gpu-ids limits the job to a subset of visible device ordinals. Set CUDA_VISIBLE_DEVICES on CUDA, or ROCR_VISIBLE_DEVICES with a matching HIP_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES overlay on ROCm, before starting KeepGPU if you need physical-device filtering. In service mode, keep-gpu list-gpus and the dashboard show these same start-compatible non-negative, unique visible ordinals as GPU IDs; physical metadata is only informational.
--vram accepts human-readable sizes or bare bytes; KeepGPU rounds the internal float32 tensor element count up to cover the requested amount. Byte-equivalent values below 4 bytes or above 1 PiB are rejected.

Leave the command running while you prepare data or review notebooks. When you are ready to hand the GPU back, hit Ctrl+C—controllers will release VRAM and exit. On Mac M series, add --busy-threshold -1 only when you intentionally want unconditional MPS keepalive work despite unavailable utilization telemetry.

Non-blocking workflow for agents¶

Use service mode when you need the terminal for follow-up commands:

keep-gpu start --gpu-ids 0 --interval 120 --vram 1GiB --busy-threshold 25
keep-gpu status

You can inspect and control sessions in a browser by starting the service and opening:

keep-gpu serve --host 127.0.0.1 --port 8765

http://127.0.0.1:8765/

KeepGPU inside Python¶

Prefer code-level control? Import the controllers directly (full recipes in Python API Recipes):

from keep_gpu.single_gpu_controller.cuda_gpu_controller import CudaGPUController

with CudaGPUController(rank=0, interval=0.5, vram_to_keep="1GiB"):
    preprocess_dataset()   # Keepalive session stays active

train_model()              # GPU freed upon exiting the context

Direct CUDA/ROCm rank values are visible ordinals in the current process and are rejected during construction if they are non-integer, negative, or outside the visible device count.

Prefer to manage multiple devices at once?

from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController

with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=60):
    run_cpu_bound_stage()

Those gpu_ids are visible device ordinals after CUDA or ROCm visibility filtering. CUDA telemetry resolves CUDA_VISIBLE_DEVICES and treats duplicate or ambiguous masks as unavailable telemetry. CUDA UUID prefixes are accepted when unique, and parsing stops at -1 after any valid preceding tokens. ROCm telemetry resolves ROCR_VISIBLE_DEVICES and one matching HIP_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES overlay before querying vendor utilization counters.

From here, jump to the CLI Playbook for scenario-driven guidance or the API recipes if you need to embed KeepGPU in orchestration scripts.

For contributors¶

Developing locally? See Contributing for dev install, test commands (including CUDA/ROCm markers), and PR tips.