Back to MCP Servers

Srunx

MCP server for the SLURM workload manager. Submit jobs, run YAML workflows, monitor GPU resources, manage SSH profiles, and sync files to remote HPC clusters from natural language. 14 tools spanning local SLURM and SSH-remote clusters; companion CLI and FastAPI Web UI ship in th…

code-executionapi
By ksterx
132Updated 1 month agoPythonApache-2.0

Installation

npx -y srunx

Configuration

{
  "mcpServers": {
    "srunx": {
      "command": "npx",
      "args": ["-y", "srunx"]
    }
  }
}

How to use

  1. Run the installation command above (if needed)
  2. Open your Claude Code settings file (~/.claude/settings.json)
  3. Add the configuration to the mcpServers section
  4. Restart Claude Code to apply changes
<div align="center"> <img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/icon.svg" width="120" alt="srunx logo">

srunx

A unified CLI, web dashboard, and Python API for SLURM job management.

Stop juggling sbatch scripts, squeue loops, and SSH sessions.

PyPI Downloads Python 3.12+ License CI Docs Ask DeepWiki

</div> <div align="center"> <img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/ui-dashboard.png" width="800" alt="srunx web dashboard"> </div>
  • Submit & manage SLURM jobs from CLI, browser, or Python
  • Orchestrate multi-step workflows with YAML and dependency graphs
  • Monitor GPU availability and job states with Slack notifications
  • Local or remote, one CLI — target a local SLURM or any SSH'd cluster with --profile <name>; no shell-in, no separate "remote" commands — the same verbs you already know
  • Container-native — Pyxis, Apptainer, and Singularity support built in

Installation

Requires Python 3.12+ and access to a SLURM cluster (local or via SSH).

uv add srunx             # with uv (recommended)
pip install srunx        # or with pip

The web dashboard and Slack notifications are included in the base install — no extras required.

For AI agent integration (MCP server), add the mcp extra:

uv add "srunx[mcp]"

Quick Start

Submit a job, wait for it, and view the logs — end to end:

# 1. Submit (use -- to separate srunx flags from the command)
$ srunx sbatch --name training --gpus-per-node 2 --conda ml_env -- python train.py
✅ Submitted job training (id=847291)

# 2. Follow until completion
$ srunx watch jobs 847291
⠋ 847291 training  PENDING  →  RUNNING  →  COMPLETED (4m 12s)

# 3. Inspect output
$ srunx tail 847291 -n 20

Or describe the whole pipeline once and let srunx drive it:

srunx flow run workflow.yaml

Same commands, remote cluster

Every command above accepts --profile <name> and dispatches transparently over SSH — same syntax, same output, same feel as local:

srunx sbatch --profile dgx --name training --gpus-per-node 2 --conda ml_env -- python train.py
srunx squeue --profile dgx
srunx tail   --profile dgx 847291 --follow
srunx flow run pipeline.yaml --profile dgx

srunx rsyncs your code under a per-mount lock, runs sbatch in place on the remote, and streams logs back. Your shell never leaves the laptop.

Why srunx?

Instead of stitching together sbatch, squeue, SSH, and a pipeline runner, srunx offers one coherent surface that covers the day-to-day SLURM loop.

Capabilitysrunxsubmititsimple-slurmSnakemake
CLI for submit / status / cancel⚠️ partial
Python API
Web dashboard
Workflow DAG with dependencies
Inter-job value passing (load-time)⚠️ via files
Matrix parameter sweeps⚠️ manual⚠️ via wildcards
GPU availability monitoring
SSH remote submit + file sync
Container support (Pyxis / Apptainer / Singularity)⚠️ limited⚠️ via rules
Slack notifications⚠️ plugin

If you need full-featured scientific workflow tooling, Snakemake / Nextflow are still the right call. srunx targets the sweet spot of "SLURM + a few dependencies + a nice UI" without Airflow-scale infrastructure.

CLI

Every command below runs locally or against a remote cluster over SSH. Add --profile <name> (or set $SRUNX_SSH_PROFILE) and sbatch / squeue / sinfo / sacct / history / gpus / tail / watch / flow run transparently dispatch through the SSH adapter — no shell-in first, no separate "remote" subcommand. srunx ssh is just for managing those profiles (add / list / sync / test); it does not run jobs itself.

Type column: SLURM = mirrors the native SLURM CLI (muscle memory maps directly); srunx = srunx-original command with no direct SLURM counterpart.

Job submission & control (SLURM parity)

CommandTypeDescription
srunx sbatch <script> / srunx sbatch --wrap "<cmd>"SLURMSubmit a SLURM job
srunx scancel <id>SLURMCancel a job

Status & accounting

CommandTypeDescription
srunx squeueSLURMList active jobs (use -j <id> for a single job's state)
srunx sinfoSLURMPartition / state / nodelist listing (native-sinfo parity)
srunx sacctSLURMReal SLURM sacct wrapper (cluster accounting DB)
srunx historysrunxsrunx's own submission history (SQLite-backed)
srunx gpussrunxGPU aggregate summary across partitions
srunx tail <id>srunxView / stream job logs
srunx watch jobs|resources|clustersrunxWatch for state changes / resource availability

Workflows & sweeps

CommandTypeDescription
srunx flowsrunxRun / validate YAML workflows
srunx flow run --arg KEY=VALUEsrunxOverride workflow args from the CLI
srunx flow run --sweep KEY=V1,V2 --max-parallel NsrunxAd-hoc matrix parameter sweep

Environment & tooling

CommandTypeDescription
srunx sshsrunxManage SSH profiles (add / list / sync / test) — remote execution itself is --profile on the commands above
srunx configsrunxManage configuration
srunx templatesrunxManage job templates
srunx uisrunxLaunch the web dashboard

More CLI examples: User Guide · Python-side counterparts: API Reference

Web Dashboard

A dashboard for visual cluster management. Connect to your SLURM cluster over SSH and manage jobs, workflows, and resources from a browser.

srunx ui                # -> http://127.0.0.1:8000
srunx ui --port 3000    # custom port

Jobs

Browse, search, filter, and cancel jobs.

<img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/ui-jobs.png" width="800" alt="Jobs page">

Workflow DAG

Visualize job dependencies. Run workflows directly from the UI.

<img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/ui-workflow-dag.png" width="800" alt="Workflow DAG visualization">

Resources

GPU and node availability per partition.

<img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/ui-resources.png" width="800" alt="Resources page">

Explorer

Browse remote files via SSH mounts. Shell scripts can be submitted as sbatch jobs directly from the file tree.

<img src="https://raw.githubusercontent.com/ksterx/srunx/main/docs/assets/images/ui-explorer-sbatch.gif" width="800" alt="Explorer sbatch submission">

Full walkthrough: Web UI tutorial · Web UI how-to · Explorer how-to

Workflow Orchestration

Define pipelines in YAML. Jobs run as soon as their dependencies complete — independent branches execute in parallel automatically.

name: experiment
args:
  model: "bert-base-uncased"
  output_dir: "/outputs/{{ model }}"

jobs:
  - name: preprocess
    command: ["python", "preprocess.py", "--out", "{{ output_dir }}/data"]
    exports:
      DATA_PATH: "{{ output_dir }}/data/processed.parquet"

  - name: train
    command: ["python", "train.py", "--model", "{{ model }}", "--data", "{{ deps.preprocess.DATA_PATH }}"]
    depends_on: [preprocess]
    gpus_per_node: 2
    environment:
      container:
        image: nvcr.io/nvidia/pytorch:24.01-py3
        mounts:
          - /data:/data
    exports:
      MODEL_PATH: "{{ output_dir }}/models/best.pt"

  - name: evaluate
    command: ["python", "eval.py", "--model", "{{ deps.train.MODEL_PATH }}"]
    depends_on: [train]

What this shows off:

  • args with Jinja2 — reusable, parameterized pipelines ({{ model }}, {{ output_dir }})
  • Inter-job exports — parents declare exports:; children read them via {{ deps.<parent>.<key> }}, fully resolved at workflow load time (no runtime env files)
  • Containers per job — Pyxis / Apptainer / Singularity are first-class (environment.container)
  • Dependency-driven schedulingevaluate blocks on train; parallel branches run automatically

Run it:

srunx flow run workflow.yaml              # execute
srunx flow run workflow.yaml --dry-run    # show plan only
srunx flow run workflow.yaml --from train # resume / partial execution

Retry with retry: N and retry_delay: <seconds> per job.

Parameter Sweeps

Run the same workflow across a matrix of hyperparameters without copying YAML. Each cell materializes into its own sbatch submission and is tracked independently.

name: train
args:
  lr: 0.01
  seed: 1

sweep:
  matrix:
    lr: [0.001, 0.01, 0.1]
    seed: [1, 2, 3]
  fail_fast: false
  max_parallel: 4

jobs:
  - name: train
    command: ["python", "train.py", "--lr", "{{ lr }}", "--seed", "{{ seed }}"]
    gpus_per_node: 1

Run it — or declare the axes ad-hoc on the command line:

srunx flow run train.yaml                                                # YAML-declared sweep
srunx flow run --sweep lr=0.001,0.01 --max-parallel 2 train.yaml          # ad-hoc
srunx flow run --sweep lr=0.001,0.01 --max-parallel 2 --dry-run train.yaml

Sweeps are a first-class concept across CLI, Web UI, and MCP. Web-triggered sweeps route cells through a bounded SlurmSSHExecutorPool against the configured SSH profile, while CLI and MCP runs use the local SLURM client by default. The Web UI surfaces per-cell progress with ETA, filter / sort, and per-cell cancellation.

Full workflow surface (validation, retries, partial execution, sweep recipes): Workflows how-to

Monitoring

# Monitor a job until completion
srunx watch jobs 12345

# Wait for GPUs, then submit
srunx watch resources --min-gpus 4
srunx sbatch --wrap "python train.py" --gpus-per-node 4

# Periodic cluster reports to Slack
srunx watch cluster --schedule 1h --notify $SLACK_WEBHOOK

Full monitoring options (continuous watch, thresholds, scheduled reports): Monitoring how-to

Remote SSH

Keep your local editor workflow while the jobs actually run on the cluster. Configure a profile once, and every srunx command accepts --profile <name> with the same syntax as local:

# One-time setup
srunx ssh profile add dgx --ssh-host dgx1
srunx ssh profile mount add dgx ml-exp \
  --local ~/projects/ml-exp --remote /home/user/ml-exp

# Same verbs you already use — now against the remote cluster
srunx sbatch train.sh --profile dgx                   # auto-rsyncs the mount + sbatch runs in-place on the remote path
srunx squeue --profile dgx                            # live queue on the remote cluster
srunx tail 847291 --profile dgx --follow              # stream remote logs
srunx flow run pipeline.yaml --profile dgx            # full DAG: sync once, hold the per-mount lock, sub

…
View source on GitHub