SSH Directly into Slurm Job

Connect your editor or terminal to the requested resources like a breeze.

Command

Host <bastion>
  HostName     <bastion-host>
  User         <user>
  IdentityFile <identity-file>

Host <bastion>-debug
  ProxyCommand ssh -q <bastion> "bash -l -c 'nc \$(squeue --me --states=R --noheader --format=%%N --name="debug" | head -n 1) 22'"
  User         <user>
  IdentityFile <identity-file>

  StrictHostKeyChecking no
  UserKnownHostsFile    /dev/null
  LogLevel              ERROR

Add this configuration to your ~/.ssh/config, and then SSH to your latest job named debug with ssh <bastion>-debug, or connect your Visual Studio Code Remote SSH or Zed Remote using the host <bastion>-debug.

A benefit of this approach: you don't need to use a fixed node, or copy-paste the node allocated for your job. All nodes running the debug job are collapsed into one host. This also keeps a clean history for your shell and editor workspaces.

Explanation

Caution

This section is generated by an LLM and audited by the author.

ProxyCommand

This command creates a remote socket to the debug job. It first connects to the <bastion> host. Then, it executes a command to find the compute node running your job:

  1. squeue: Queries the Slurm queue.
    • --me: Filters for jobs owned by the current user.
    • --states=R: Selects only running jobs (you can't SSH into a pending job).
    • --name="debug": Selects jobs named "debug" (adjust this to match your interactive session name, e.g., salloc -J debug ...).
    • --format=%N: Output only the node name. Note the double percent %%N in the config to escape it for SSH.
    • --noheader: Suppresses the header row.
  2. head -n 1: Takes the first node if the job spans multiple nodes or if there are multiple debug jobs.
  3. nc <node> 22: Uses netcat to establish a TCP connection to port 22 (SSH) on the identified compute node.

Host Key Checking

  StrictHostKeyChecking no
  UserKnownHostsFile    /dev/null
  LogLevel              ERROR

Since the compute node assigned to your job changes every time you start a new session, the host key will constantly change.

  • StrictHostKeyChecking no: Disables the safety check that blocks connection when a host key changes.
  • UserKnownHostsFile /dev/null: Prevents saving the host key to your known_hosts file, avoiding "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!" errors in the future.
  • LogLevel ERROR: Suppresses the warning messages about the host key being added to the list of known hosts (which is /dev/null anyway).

Prerequisites

Ensure you have an interactive job running with the matching name (debug in this case) before connecting. And this job should occupy 1 node only (otherwise Slurm might output node[01-02] which is not a valid hostname).

salloc --name=debug --time=1:00:00 --node 1 ...

Also this assumes your cluster uses node names as hostnames. Otherwise you might need to use scontrol show hostnames to resolve it first.