HPC Usage
👍Rule of Thumb
Do NOT run compute-heavy jobs on login nodes.
Run a small, test-scale job first before submitting large batch jobs.
Be respectful to others' computing resources; if you have thousands of small jobs, use a Job Array.
🏁Quick Starts
HPC Quick Start Guide: https://csmc.service-now.com/cssp?id=kb_article_view&sys_kb_id=2661d54c1bdcca504546c80b234bcb6a
Research HPC Support Request: https://csmc.service-now.com/cssp?id=sc_cat_item&sys_id=223547741bca2d50670b2068b04bcb73
Cedars Tutorial: http://esplscsmgt01.csmc.edu/hpc.html
Department of Computational Biomedicine's guide: https://cedars.app.box.com/s/wxdznmrexzb0p3ocgiyrniqs8stqvx0c [Excellent resource; updated perodically]
🔥Basic Concepts
As of November 2022, there are two HPCs in Cedars-Sinai. The old HPC was built in 2013 and named as Cisco; the new HPC was built in 2022 and named as HPE.
Cisco cluster has decent CPU power and RAMs. You should consider using it if your job is typical bioinformatic workflows, such as STAR alignment, RNA-seq processing etc.
HPE is a new cluster built in 2022, managed by Slurm, and is equipped with cutting-edge GPUs (2x A100 belongs our group, 8x A100 and 4x V100 of shared resources).
Read below to learn how to best navigate these two clusters.
Nodes
A typical HPC is an aggregation of several computers called nodes. Each node has it's own CPU and RAM, but storage is shared across all nodes on the same cluster (i.e regardless of which node you are logged into, all of your directories will be the same). Groups of nodes are dedicated for a particular use case. There are three primary types of nodes on the Cedars HPC. Only Transfer Nodes at Cedars HPC have a high internet speed, so be sure to use them for transferring large data.
Submit Node: Logging in, submitting jobs, requesting interactive nodes; no heavy compute.
on Cisco: csclprd3-s00[1,2,3]v.csmc.edu
on HPE: esplhpccompbio-lv0[1,3].csmc.edu
Transfer Node: Downloading and transferring large files.
on HPE: hpc-transfer01.csmc.edu
Compute Node: Workhorse for execution of programs and data analysis.
on Cisco: csclprd3-c[XXX]
on HPE: esplhpc-cp[XX]
Storage
The storage in HPC is called Network Attached Storage, or NAS. They are shared (or mounted/attached to each node) across all nodes on the same cluster. The only storage that is shared across the two different clusters (2013 Cisco and 2022 HPE) would be the \common
folder (which all users will have a shortcut named ~/common
that points to their /common/
folder).
A lab-shared storage partition can be requested by PIs for their group members. Right now, the Zhang lab has a shared storage of 90TB point to /common/zhangz2lab/
. This is also shared between the two clusters. For convenience, you can softlink it to your home directory by ln -s /common/zhangz2lab $HOME/zhanglab
.
Tip: since
/common/
is shared between two clusters, you can install your local Anaconda in this folder, and have your.bashrc
point to the local folder, so that you will have the same working environments between two HPCs.
A fast scratch storage of 2TB per user is available on 2013 Cisco cluster at /scratch/username
. These have faster I/O speeds, therefore a suitable for short turnaround experiments. However, note that any files older than 7 days will be automatically deleted in scratch!!
Access & Workflow
Generally, SSH (see the SSH section below) is used to login to Submit Node. SSH is by default available in Linux/MacOS terminals, and can be installed by MobaXterm or PuTTy on Windows.
Once you are on the Submit Nodes, you can submit jobs and ask for resource allocations using the Job Management systems (see the Job Management section following SSH).
📺SSH
Connecting to old Cisco CSMC cluster (2013)
Cisco cluster has decent CPU power and RAMs. You should consider using it if your job is typical bioinformatic workflows, such as STAR alignment, RNA-seq processing etc.
Add the following lines to your $HOME/.ssh/config
Then to ssh, use ssh csmc
in your terminal. You need to be on Cedars intranet or VPN.
Connecting to the new HPE cluster (2022)
HPE is a new cluster built in 2022, managed by Slurm, and is equipped with cutting-edge GPUs (8x V100). However, it can be crowded at times if someone submits too many jobs.
📚Job Management (SGE)
In general you can search for specific usages on Google, such as qsub, qrsh, qstat. These resources are applicable to most SGE (sun grid engine) and/or managed job systems. Below I provide a few commands for day-to-day usages:
Getting an interactive node
On HPE, put these in your
.bash_aliases
: alias salloc-gpu="salloc --gpus=v100:1 --time=1-0 --mem=8g" alias salloc-cpu="salloc -c=8 --time=1-0 --mem=8g"On Cisco: alias qrsh-cpu="qrsh -l h_rt=24:00:00,h_mem=8g"
Submit a CPU/GPU job
On HPE, use the following template
For Cedars HPC documentation, see below:
http://esplscsmgt01.csmc.edu/hpc.html (you need to be on Cedars intranet)
🔗Useful links
Setting up VSCode to work with Remote server: here
Requesting more lab-wise network-attached storage at Cedars HPC [only needed if our current storage space is not enough and/or you have large files incoming; ask Frank if you are not sure]: https://csmc.service-now.com/cssp?id=sc_cat_item&sys_id=d6f51aa54fc27e80ad486cd18110c75a
Last updated