Computing clusters: Local Guide
We have two separate clusters,
old a324-2.vscht.cz (IP = 147.33.103.153 static) and
new as67-1.vscht.cz (IP = dynamic, now 147.33.79.103).
User home directory is /home/USER, user data directory is /data/USER.
On nodes (clients), a user can use scratch disk space in directory
/scratch/USER/, where USER is user name.
Scratch disks are nonperiodically cleared from older files.
This event is announced in advance.
Configuration as of May 2009:
a324-2.vscht.cz (147.33.103.153) contains two filesystems:
- /home = 246 GB of user space, accessible (via NFS) from all clients, periodically backed-up
- /data = 158 GB of space, not available in clients and not backed-up
| computer | proc- essor | co- res | mem [GiB] | scratch [kiB] | normal queue | nice queue |
| a00 = a324-2 | Athlon | 2 | 4 | - | - | - | | a03 | Athlon | 1 | 1 | 71 259 640 | aqa-1-1 | nqa-1-1 |
| a04 | Athlon | 1 | 1 | 71 259 640 | aqa-1-1 | nqa-1-1 |
| a08 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a09 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a10 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a11 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a12 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a13 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a14 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a15 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a16 | Athlon | 2 | 4 | 184 401 136 | aqa-2-4 | nqa-2-4 |
| a20 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a21 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a22 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a23 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a24 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a25 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a26 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a27 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a28 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a29 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a30 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a31 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a32 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a33 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
| a34 | Athlon | 2 | 2 | 107 486 652 | aqa-2-2 | nqa-2-2 |
System administrator: Jiri Kolafa
|
as67-1.vscht.cz contains one filesystem:
- /home = 4.5 TiB of user space, accessible (via NFS) from all clients.
No backups are performed!
comp- uter | proc- essor | co- res | mem [GiB] | scratch [kiB] | normal queue | nice queue |
| s01 | Opteron | 8 | 16 | 544 089 632 | sq-8-16 | mq-8-16 |
| s02 | Opteron | 8 | 16 | 544 089 632 | sq-8-16 | mq-8-16 |
| s03 | Opteron | 8 | 16 | 544 089 632 | sq-8-16 | mq-8-16 |
| s04 | Opteron | 8 | 16 | 544 089 632 | sq-8-16 | mq-8-16 |
| s05 | Opteron | 8 | 16 | 544 089 632 | sq-8-16 | mq-8-16 |
| s48 | Athlon | 2 | 4 | 184 401 136 | sq-2-4 | mq-2-4 |
| s49 | Athlon | 2 | 4 | 184 401 136 | sq-2-4 | mq-2-4 |
| s50 | Athlon | 2 | 4 | 184 401 136 | sq-2-4 | mq-2-4 |
| s51 | Athlon | 2 | 4 | (problem) | sq-2-4 | mq-2-4 |
| s52 | Athlon | 2 | 4 | 184 417 200 | sq-2-4 | mq-2-4 |
| s53 | Athlon | 2 | 4 | 184 417 200 | sq-2-4 | mq-2-4 |
| s54 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s55 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s56 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s57 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s58 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s59 | Athlon | 2 | 4 | 232 473 724 | sq-2-4 | mq-2-4 |
| s60 | Opteron | 4 | 8 | 368 836 168 | aqo-4-8 | nqo-4-8 |
| s61 | Opteron | 4 | 8 | 368 836 168 | aqo-4-8 | nqo-4-8 |
| s62 | Opteron | 4 | 8 | 368 836 168 | aqo-4-8 | nqo-4-8 |
System administrator: Dr. Polach (jiri.polach(at)marge.uochb.cas.cz)
|
User disk space (/home) is limited by quotas.
To check your quota status, use command:
quota -s
It is wise to put this command to your .login (if you are using
tcsh) or .profile (if you are using
bash).
The quota value reported can be exceeded up to the limit, but not for more than the grace period of 7 days.
This manual: http://www.vscht.cz/fch/en/research/cluster.html
Access to the clusters is possible only via the Secure Socket Shell (ssh)
and (directly and with X11 forwarding) only from computers inside the VSCHT
domain. A user logs into the server a324-2.vscht.cz (147.33.103.153).
ssh USER@a324-2.vscht.cz
(answer yes for adding a324-2.vscht.cz to your list of
trusted hosts)
There are several implementations of ssh for Windows. We recommend
Putty.
Start Putty, enter a324-2.vscht.cz as the host and
select SSH as the service (the TCP port should be 22).
One option is scp:
to copy file /home/USER/MYFILE/ from cluster to your local ./:
scp USER@a324-2.vscht.cz:MYFILE .
to copy your local file file ./MYFILE to /home/USER/MYFILE on the
cluster
scp MYFILE USER@a324-2.vscht.cz:MYFILE
Another option is sftp.
One option is WinSCP, which provides a
Windows commander-like or Explorer-like interface to transfering files.
Start WinSCP.exe, enter the host a324-2.vscht.cz,
then your name and the password. The TCP port should be 22.
To use graphical applications like gnuplot, xxgdb, etc., on the cluster, you
need an X11 server running on your machine.
Note: People are often confused by the
client/server model of X11. A client (running on a remote machine, in
our case a324-2.vscht.cz) asks the server (running, e.g., on your M$
Windoze PC) to display graphics (e.g., to draw a rectangle).
Normally an X11 server is running and ssh with option -X
provides transparent X forwarding:
ssh -X USER@a324-2.vscht.cz
In case of problems: You may need to add the client
computer to the list of allowed hosts on your computer. Thus, on your
computer, run:
xhost +a324-2.vscht.cz
Sometimes it may be also needed to set DISPLAY as above on a324-2.vscht.cz
(see below).
The recommended X server is XMing.
To establish a connection, you must
- Enable X forwarding in yout Putty session by selecting
Connection
→ Tunneling → X11 → Enable X11 forwarding
- Set Disable server access control while starting XMing.
A running XMing is indicated by a small X in the right side of your task bar.
If anything gets wrong, try set the DISPLAY environment variable in your
Putty shell:
setenv DISPLAY NAME:0.0 # in csh, tcsh
export DISPLAY=NAME:0.0 # in sh, bash
where NAME is the name of your computer, e.g.,
mycomp.vscht.cz or 147.33.103.16.
Note: An X11 session cannot be
started automatically from Windows because rexec and rsh are disabled
on the server (for safety reasons)
The simplest option is to use the SSH gateway: use any ssh connection to
ftpin.vscht.cz and log in as sshgw
(mnemonics: SSH GateWay) with passwd=sshgw. Then, type the name of
the target computer (a324-2), your user ID, and password.
X11 forwarding is not supported; for connecting incl. graphics, see below.
ssh sshgw@ftpin.vscht.cz
sshgw@ftpin.vscht.cz's password: sshgw
Zadejte adresu systemu ke kteremu se chcete pripojit
>a324-2
Zadejte jmeno uzivatele, pod kterym se chcete pripojit
>USER
Probiha pripojovani...
USER@a324-2's password: PASSWORD
Use Putty as above with host=ftpin.vscht.cz
and user=sshgw, password=sshgw. Then see above.
One option is to use the ftp server ftpin.vscht.cz, otherwise see below. You need a (temporary) account at ftpin.vscht.cz. From
a shell at the cluster, run
telnet ftpin
login: ftpman
Password: ftpman
Zadejte nove uzivatelske jmeno: USER (=new user name)
Zadejte uzivatelske heslo:PASSWORD
Zadejte uzivatelske heslo znovu(kontrola):PASSWORD (once more to check)
Kolik dni chcete ponechat ucet aktivni(maximum 7)?[1] DAYS_ACTIVE (account active max 7 days)
Zmacknete ENTER pro zalozeni uzivatele nebo ukoncete spojeni. ENTER
Then you can access USER@ftpin.vscht.cz from both the VSCHT domain and
outside by your favorite ftp client.
Hint: put your login data into file .netrc, both on the cluster and
on your Linux machine, e.g.
machine ftpin.vscht.cz
login USER
password PASSWORD
macdef init
binary
As soon as you get a VPN session to VSCHT established, you may connect
(incl. X-forwarding) and send files directly from your remote PC. You need
to install a VPN client, though, and this approach to some extent
compromises security of your PC. For more info consult the
official manual.
Before first use on your home Linux computer,
get snx_install.sh and
install it by running sh snx_install.sh as a root.
Unfortunately, snx requires an old version of library
(libstdc++2.10-glibc2.2). If this is not your default library, install it
and do the following hack (as root):
cd /usr/bin
mv snx snx-bin
echo LD_PRELOAD\=libstdc++-libc6.2-2.so.3 /usr/bin/snx-bin \"\$\@\" > snx
chmod u+rsx snx
chmod go+x snx
Connect to VPN by command (as root):
snx -s 147.33.1.3 -u VSCHTUSER
and enter your VSCHT password when requested. (VSCHTUSER is your short login
name to VSCHT domain.)
Now you shoud receive an "Office Mode IP".
The new interface is called tunsnx (check this by ipconfig).
To disconnect the VPN session, run as root:
snx -d
Alternatively, you can use a browser-based method (Java RTE is needed)
similarly as in Windows (see below) [not tested].
As soon as you are connected by VPN to VSCHT, the usage is
the same as if you are in the domain (ssh -X
USER@a324-2.vscht.cz).
You can access your home computer via
the OFFICE_MODE_IP from computers in the VSCHT domain; e.g.,
you can transfer files
from a computer (of "Kategorie 2") to home like
scp REMOTEFILE LOCALUSER@OFFICE_MODE_IP:LOCALFILE
scp LOCALUSER@OFFICE_MODE_IP:LOCALFILE REMOTEFILE
(At this moment, cluster is not "Kategorie 2" so that this direction does
not work.)
- In a browser, open https://147.33.1.3
- Check that your browser allows pop-up windows
- Accept the certificate and ignore messages of non-matching names
- If a pop-up window appears, log in as to the VSCHT network (use short
username, not full name with a dot, and your e-mail password)
- You should receive an IP address (field "Office Mode IP")
Details may differ according to your browser.
- If used for the first time, you will be asked to install snx.
Make sure you have administrator privileges!
-
In addition, you may need administrator privileges for accepting the snx
connection (although this can be probably set up so that eventually you may
connect as an ordinary user).
As soon as you are connected by VPN to VSCHT, the usage is
the same as if you are in the domain (Putty and
XMing).
A user is normally logged to the server (a324-2.vscht.cz) where
(s)he can manage and edit
files, compile and debug programs, submit jobs to be run on the client nodes, and
analyse the results (incl. X11 graphics).
No lengthy calculations are allowed directly on the server!
If necessary (e.g., lengthy interactive debugging), you may also use ssh to connect directly to machines
inside the cluster. It is not allowed to jump the queue of submitted
jobs in this way!
The most important commands to survive are:
| passwd | Change your password |
| man COMMAND | Get the manual page of COMMAND |
| xman | Manual pages browsing tool, requires X11 |
| info | Comprehensive manual of GNU software |
| info COMMAND | Info on COMMAND (often more up-to-date than the man-page) |
Your shell is tcsh or bash. (To figure out which one,
execute command ps.) To get help, use
man tcsh or
man bash
It's pretty long, isn't it? A few basic commands, common for both shells, are listed below.
ls ls DIRECTORY/ | List files |
ls -l ls -l DIRECTORY/ | List files with verbose info |
| less FILE | View a text file; use arrows, PgUp/PgDn or u/space, quit by q |
cp -i FILE1 FILE2 cp FILE(s) DIRECTORY/ | Copy files. Asks for confirmation if the destination file is to be overwritten |
mv -i FILE1 FILE2 mv FILE(s)
DIRECTORY/ | Rename or move files. Asks for confirmation if the destination file is to be overwritten |
| rm -i FILE(s) | Remove files. Asks for confirmation |
(Option -i in the above commands ensures confirmation if a file
is about to be overwritten or erased; based on your environment, you may
have aliased the above commands so that -i does not have to be
used.)
Another possibility is to replace this shell by the Midnight
Commander, a clone of the popular Norton Commander. It is started by:
mc
| mc [F4] | Internal editor of the Midnight
Commander is invoked by typing [F4] |
| joe FILE | Simple text editor of the WordStar/Turbo family |
| emacs FILE | Powerful but complicated text editor |
| vi FILE | UNIX classical text editor (incomprehensible for Windows users) |
Another possibility is to edit files locally (on your Windows) and to
move them using WinSCP.
A user batch job (binary executable or a script) is submitted on the
server to a queue. As soon as there are resources available, the job
is started on a client.
There are two instances of queues:
- Normal queue. The tasks are started with normal (maximum)
priority (nice 0). The limit is about 10 jobs/user. Extra jobs will wait in a
queue. Basic commands to control jobs are qsub,
qdel, qstat.
- Nice queue. The tasks are started with minimum priority
(nice -19). The limit is about 30 jobs/user. Extra jobs will wait in
a queue. Basic commands to control jobs are qsubsec,
qdelsec, qstatsec; any command for the nice
queue can be also obtained by prefix sgesec, e.g., the
following two commands are equivalent:
sgesec qsub nicescript.sh
qsubsec nicescript.sh
In addition, there is the local rule:
- A user with more nice jobs than the maximum is not allowed to submit more than 5 normal jobs.
- A user with more normal jobs than the maximum is not allowed to submit more than 10 nice jobs.
Examples:
-
Submit program a.out on a324-2 with moderate memory and time requirements:
qsub -cwd -b y -q "aqa-2-?" a.out
- a.out must be a binary executable (option -b y). It is started from the current directory (option -cwd)
- Output is in files a.out.o# (stdout) and a.out.e# (stderr), where # denotes job number
- The job may be started on both aqa-2-2 and aqa-2-4
-
Submit program a.out on a324-2 which reads file fort.11 and
writes file fort.22; however, your input file is
myin.dat and output should be myout.dat. All these
files are in the current directory (option -cwd).
The running program needs 3 GiB of memory (4 in option -q
"aq?-?-4"). The standard output (stdout) and error output
(stderr) are concatenated (option -j y). The name of the job
should be MYJOB.
- Create script myscript.sh:
#!/bin/bash
#$ -cwd
#$ -q "aq?-?-4"
#$ -j y
ln -s fort.11 myin.dat
a.out
mv fort.22 myout.dat
- then make it executable and submit it:
chmod +x myscript.sh
qsub -N MYJOB myscript.sh
- Submit a time-intensive calculation (script simul.sh) as a nice job
on any client of as67-1
qsubsec -q "mq-*-*" simul.sh
- A shortcut command to submit a job (binary executable or script) with
parameters and in (a maximum of) the current environment is (example)
jsub a.out
Run it jsubwithout arguments to get help.
- Edited listing for the current user:
jstat
- Basic listing commands:
qstat
qstatsec
- Detailed listing:
qstat -f
qstatsec -f
- Remove normal job waiting in a queue (use qstat or jstat to get JOB-ID):
qdel JOB-ID
- Remove nice job waiting in a queue (use qstatsec or
jstat to get JOB-ID):
qdelsec JOB-ID
- Kill (interrupt, signal) a running job
- Use qstat/qstatsec/jstat to determine on which client the job is running
- ssh to the client (e.g., a77) and determine the process ID:
ssh a77
ps x
- Signal the running process:
kill -2 ID | # SIGINT, the same as Ctrl-C interactively; # many applications finish (some) work and close files |
| kill ID | # try harder (signal SIGTERM) |
| kill -9 ID | # kill unconditionally (nothing saved) |
- Alternatively, run top and kill the process by hot key k.
The PGI optimizing and
parallelizing FORTRAN 77/90 and C/C++ compilers w. debugger have been
installed on the old cluster.
The manual is invoked by command
netscape $PGI/doc/index.htm
provided that you have an X11 server running, or in
the text mode (with much less user comfort) by
lynx $PGI/doc/index.htm
To compile and debug a simple (one-module) FORTRAN program for use on one processor, use:
| pgf77 -g myprog.f | Compile FORTRAN 77 program for debugging. Output executable is a.out |
| pgdbg a.out | Debug a.out |
| Xpgdbg a.out | Debug a.out using a graphical interface to gdb -- X11 is needed |
| pgf90 -O2 -o myprog myprog.f | Compile FORTRAN 90 program for final run (maximum optimization). Output executable is myprog |
Note: The PGI compiler usually gives faster code (by 20-30%) and is therefore recommended.
To compile and debug a simple (one-module) FORTRAN program, use:
| g77 -g myprog.f | Compile FORTRAN program for debugging. Output executable is a.out |
| gdb a.out | Debug a.out -- see below |
| xxgdb a.out | Debug a.out using a graphical interface to gdb -- X11 is needed |
| g77 -O3 -ffast-math -o myprog myprog.f | Compile FORTRAN program for final run (maximum optimization). Output executable is myprog |
For programs of consisting of several modules, read the manual pages for g77 and make.
Call as
gdb PROGRAM
If your program has crashed and dumped the core (file core), you
can perform a post-mortem analysis by
gdb PROGRAM core
To debug an already started program, use
gdb PROGRAM PID
where PID is a process number that can be obtained by top or ps.
A few survival commands of gdb follow
| Synopsis | abbr | Explanation |
break FILE:LINE break FUNCTION | b | Set breakpoint (to the beginning of line) |
| cont | c | Continue execution |
| del [NUMBER] | d | Delete breakpoint(s) |
| help [COMMAND] | h | Get help (on COMMAND) |
| list | l | List several lines up and below in the source |
| next | n | Execute next line incl. all functions |
| print [EXPRESSION] | p | Print the value of variable or expression |
| run [ARGS] | r | Start program (w. command line ARGS) |
| step | s | Execute next line; if there is a function or procedure call, step into it. |
tbreak FILE:LINE tbreak FUNCTION | tb | Set temporary breakpoint (deleted after use) |
| watch [EXPRESSION] | wa | Breakpoint if expression changes. Very slow! |
| what [EXPRESSION] | wha | Type of EXPRESSION |
| where | whe | To see where in the program you are |
| Ctrl-C | | Interrupt running program |
Note: EXPRESSION is either one variable or a valid expression in
the syntax of the language being debugged.
Backups of a324-2 are performed daily at night on a325-11 ("the backup server")
using rsync. All user home directories (/home/*/) are
included. Four daily backups are kept, and weekly backups (Saturday to
Sunday) up to the backup server capacity. Older files are lost.
Currently only root can access a325-11; if you need to recover a damaged or lost file, please let me know.
There is no backup service for as67-1. Any removed file is lost.