HPC Usage

👍Rule of Thumb

  • Do NOT run compute-heavy jobs on login nodes.

  • Run a small, test-scale job first before submitting large batch jobs.

  • Be respectful to others' computing resources; if you have thousands of small jobs, use a Job Array.

🏁Quick Starts

Cedars Tutorial: http://esplscsmgt01.csmc.edu/hpc.html

Department of Computational Biomedicine's guide: https://cedars.app.box.com/s/wxdznmrexzb0p3ocgiyrniqs8stqvx0c [Excellent resource; updated perodically]

🔥Basic Concepts

As of November 2022, there are two HPCs in Cedars-Sinai. The old HPC was built in 2013 and named as Cisco; the new HPC was built in 2022 and named as HPE.

Cisco cluster has decent CPU power and RAMs. You should consider using it if your job is typical bioinformatic workflows, such as STAR alignment, RNA-seq processing etc.

HPE is a new cluster built in 2022, managed by Slurm, and is equipped with cutting-edge GPUs (2x A100 belongs our group, 8x A100 and 4x V100 of shared resources).

Read below to learn how to best navigate these two clusters.

Nodes

A typical HPC is an aggregation of several computers called nodes. Each node has it's own CPU and RAM, but storage is shared across all nodes on the same cluster (i.e regardless of which node you are logged into, all of your directories will be the same). Groups of nodes are dedicated for a particular use case. There are three primary types of nodes on the Cedars HPC. Only Transfer Nodes at Cedars HPC have a high internet speed, so be sure to use them for transferring large data.

  • Submit Node: Logging in, submitting jobs, requesting interactive nodes; no heavy compute.

    • on Cisco: csclprd3-s00[1,2,3]v.csmc.edu

    • on HPE: esplhpccompbio-lv0[1,3].csmc.edu

  • Transfer Node: Downloading and transferring large files.

    • on HPE: hpc-transfer01.csmc.edu

  • Compute Node: Workhorse for execution of programs and data analysis.

    • on Cisco: csclprd3-c[XXX]

    • on HPE: esplhpc-cp[XX]

Storage

The storage in HPC is called Network Attached Storage, or NAS. They are shared (or mounted/attached to each node) across all nodes on the same cluster. The only storage that is shared across the two different clusters (2013 Cisco and 2022 HPE) would be the \common folder (which all users will have a shortcut named ~/common that points to their /common/ folder).

A lab-shared storage partition can be requested by PIs for their group members. Right now, the Zhang lab has a shared storage of 90TB point to /common/zhangz2lab/. This is also shared between the two clusters. For convenience, you can softlink it to your home directory by ln -s /common/zhangz2lab $HOME/zhanglab.

Tip: since /common/ is shared between two clusters, you can install your local Anaconda in this folder, and have your .bashrc point to the local folder, so that you will have the same working environments between two HPCs.

A fast scratch storage of 2TB per user is available on 2013 Cisco cluster at /scratch/username. These have faster I/O speeds, therefore a suitable for short turnaround experiments. However, note that any files older than 7 days will be automatically deleted in scratch!!

Access & Workflow

Generally, SSH (see the SSH section below) is used to login to Submit Node. SSH is by default available in Linux/MacOS terminals, and can be installed by MobaXterm or PuTTy on Windows.

Once you are on the Submit Nodes, you can submit jobs and ask for resource allocations using the Job Management systems (see the Job Management section following SSH).

📺SSH

Connecting to old Cisco CSMC cluster (2013)

Cisco cluster has decent CPU power and RAMs. You should consider using it if your job is typical bioinformatic workflows, such as STAR alignment, RNA-seq processing etc.

Add the following lines to your $HOME/.ssh/config

Host csmc
   # You can also use s-00[123]v
   Hostname csclprd3-s002v.csmc.edu 

   # !!!CHANGE THIS!!! to your cluster username
   User MyUserName

   ForwardX11 yes
   # if you trust us (like -Y)
   ForwardX11Trusted yes

   # allows you to connect from other windows without re-authenticating
   ControlPath ~/.ssh/.%r@%h:%p
   ControlMaster auto

   # nudge the server every 100 seconds to keep the connection up
   ServerAliveInterval 100

Then to ssh, use ssh csmcin your terminal. You need to be on Cedars intranet or VPN.

Connecting to the new HPE cluster (2022)

HPE is a new cluster built in 2022, managed by Slurm, and is equipped with cutting-edge GPUs (8x V100). However, it can be crowded at times if someone submits too many jobs.

Host csmc-hpe
   # You can also use lv0[123]
   Hostname esplhpccompbio-lv02.csmc.edu 

   # !!!CHANGE THIS!!! to your cluster username
   User MyUserName

   ForwardX11 yes
   # if you trust us (like -Y)
   ForwardX11Trusted yes

   # allows you to connect from other windows without re-authenticating
   ControlPath ~/.ssh/.%r@%h:%p
   ControlMaster auto

   # nudge the server every 100 seconds to keep the connection up
   ServerAliveInterval 100

📚Job Management (SGE)

In general you can search for specific usages on Google, such as qsub, qrsh, qstat. These resources are applicable to most SGE (sun grid engine) and/or managed job systems. Below I provide a few commands for day-to-day usages:

  • Getting an interactive node

    • On HPE, put these in your .bash_aliases: alias salloc-gpu="salloc --gpus=v100:1 --time=1-0 --mem=8g" alias salloc-cpu="salloc -c=8 --time=1-0 --mem=8g"

    • On Cisco: alias qrsh-cpu="qrsh -l h_rt=24:00:00,h_mem=8g"

  • Submit a CPU/GPU job

    • On HPE, use the following template

!/bin/bash
#SBATCH --nodes 1
#SBATCH --partition=defq
#SBATCH --mem 8G
#SBATCH --time 1-0:00:00

python your_script.py

For Cedars HPC documentation, see below:

http://esplscsmgt01.csmc.edu/hpc.html (you need to be on Cedars intranet)

  1. Setting up VSCode to work with Remote server: here

  2. Requesting more lab-wise network-attached storage at Cedars HPC [only needed if our current storage space is not enough and/or you have large files incoming; ask Frank if you are not sure]: https://csmc.service-now.com/cssp?id=sc_cat_item&sys_id=d6f51aa54fc27e80ad486cd18110c75a

Last updated