The E10K at NAVO has 400 MHz UltraSPARC (v.9) CPUs. Each CPU has a 16 KB
Level-1 data cache and a 4 MB Level-2 cache. The UltraSPARC-II is a pipelined
superscalar architecture which can issue 2 floating-point instructions
per clock cycle. Thus the maximum sustained performance is 2 times the
clock rate (ie. up to 0.8 GFLOPS for the 400 MHz chips).
The Sun E10K uses a proprietary crossbar (called the
Gigaplane-XB) to connect CPUs to memory. This
interconnect has a bandwidth of 12.8 GB/sec.
REALLY REALLY Quick Introduction
.cshrc:
Typical compiler optimization switches are:
mpcc -fast -xprefetch -o foo foo.c -lmpi
The -fast option is a shortcut for -xO4 -xtarget=native -fns
-fsimple=1 -dalign -ftrap=%none -xlibmil -fsingle. Note that some of
these options turn off certain IEEE floating point exceptions.
The -xO4 tells the compiler to generate optimized code (level
5 is the highest setting). Note that
-xO4 has the potential to
alter the semantics of your code and thus
may cause errors at run-time!!
The -xprefetch
tells the compiler to generate code which will prefetch memory references
(if the architecture supports it, UltraSPARC-II and higher).
The -fsingle option can be extremely useful if your code does a lot
of single-precision floating point work. By default, the single-precision
variables will be upgraded to double-precision, the work will be performed
in double-precision, then downgraded to a single-precision result. By
selecting -fsingle, all work will be done in single-precision, and
the conversion steps are avoided.
For Fortran 90 programs, you need to specify -fixed for fixed-format
Fortran lines or -free for free-form.
Note that there have been significant changes made to the V.8, V.8plus,
V.9, etc. rev's of the SPARC chips. You should therefore make sure
that you target the proper architecture that you will be running on (or
use "native" if you will only be using one type of SPARC system).
Use of -xtarget=native will also insure that the compiler uses
the proper values to optimize for the chip's cache sizes.
To create 64-bit code, use "-xarch=v9" or "-xarch=v9a" on
the compile line, but make sure you put this after the -xtarget
option. Thus, you can compile 32-bit code by setting
-xtarget=native; to compile 64-bit code you need
-xtarget=native -xarch=v9. The mpcc compiler will
automatically link in the proper MPI library.
For OpenMP codes, you'll need to turn on smp support in the compiler
with the -xautopar -xparallel -xexplicitpar options. According
to the man pages, you should avoid using -xparallel if you do
your own thread management.
If you use both OpenMP and MPI in the same code, you will need to link
with the multi-threaded version of MPI by using -lmpi_mt.
Sun also provides a vector math library for evaluating common mathematical
functions on entire arrays of arguments. To use these libraries, link
with libmvec or libmvec_mt for multi-threaded codes. More
information on the specific routines can be found on the libmvec
man page.
For 32- and 128-bit datatypes, you should link with the libsunmath
library for best performance. For 64-bit datatypes (or to have 32-bit
values automatically promoted to 64-bit), just use the standard libm
or libmopt libraries.
Finally, Sun provides the Scalable Scientific Subroutine Library (S3L) for
parallel codes. Link with -ls3l to include these routines.
Details on how to use the S3L library is beyond the scope of this page,
see the S3L User's Guide.
To submit jobs, use the bsub program. Much like NQS and
Codine, you can submit jobs directly from the command line with bsub,
but it is much easier to create a script to run your job, then submit the
script. As with NQS and Codine, LSF allows you to embed queuing options
into your scripts. Where NQS uses #QSUB and Codine uses
#$, LSF uses #BSUB. In most cases, there is a 1-to-1
replacement of NQS/Codine keywords with LoadL keywords. Here's a
good starting point:
Use bqueues to find out about what queues are available (or
bqueues -l for more detailed information).
ps can provide more detailed information by using the -o
(output format) option. For example:
From the man page, this command will output (in order)
the real user ID, the command name, the job class, total memory size (KB),
elapsed time since the start of the program, cumulative CPU time, the LWP
ID, the state of the process. The process state is one of "O" (the
process is running), "S" (sleeping), "R" (on the run queue), "Z" (zombie),
or "T" (stopped). The output will probably scroll off the screen, but it
is easily parsed by awk, grep, or perl scripts.
(This info was lifted from the psmap
shell script from Alan Wallcraft at NRL)
Since most of the machine will be used through the LSF queuing system,
the bjobs command is quite useful to see how loaded the machine
is. The most useful information comes from bjobs -u all
or bjobs -u all -r -p -s.
This prints jobs from all users, including -r running jobs,
-p pending jobs, and -s suspended jobs. You can also use
-a to print all jobs, but this will include recently completed jobs
as well (which are no longer loading the machine).
Use bjobs -u all -l for highly detailed information
(this is barely human-readable, but
easily parsed by Perl scripts into a more readable format).
A real quick snapshot can be obtained with lsload. This will
return the 15 second, 1 minute, and 15 minute load averages for the
machine. lsmon will continually update and display this information
in text, xlsmon will display it graphically.
Finally, psrinfo may be of some use if you want to know how many
nodes are currently up or down (psrinfo -v also shows each CPU's
type and speed).
Introduction
The Sun Enterprise 10000 is a shared memory multiprocessor system using
workstation-class components and a high-bandwidth memory crossbar. The
current implementation of the E10K installed at the NAVO has 64 CPUs and
64 GB of total memory.
Sun E10K Roadmap
From Sun's website, the roadmap for both the Sun E10K and SPARC architecture
looks promising.
.cshrc & .login Files
The following are useful additions to make to your .cshrc and
.login files:
set path = ( /usr/local/bin /bin /usr/bin /usr/sbin /usr/ccs/bin \
/opt/SUNWspro/bin /opt/SUNWhpc/bin /opt/SUNWlsf/bin \
/usr/openwin/bin /usr/ucb . )
setenv MANPATH /usr/share/man:/usr/local/man:/opt/SUNWspro/man:\
/opt/SUNWlsf/man:/opt/SUNWhpc/man
setenv LD_LIBRARY_PATH /opt/SUNWhpc/lib:/usr/openwin/lib
setenv OPENWINHOME /usr/openwin
.login:
if( $?LSB_QUEUE ) exit
Compilers
For parallel codes, Sun provides mpcc and mpf77
wrappers to their sequential compilers. Thus, you will not need to "manually"
include the location for the header. This is otherwise
found in /opt/SUNWhpc/include. However, you do need to include
the MPI library with -lmpi (this is normally found in
/opt/SUNWhpc/lib, 64-bit versions are found in
/opt/SUNWhpc/lib/sparcv9).
You should also make sure that you are using the compilers in /opt/SUNWspro/bin
for best performance (type "which cc" to see which compiler is
earliest in your path).
Useful Libraries
The usual scientific libraries come with Sun's Performance Workshop. These
include LAPACK, BLAS 1/2/3, and FFTPACK. There is also an optimized
math library libmopt that is included automatically with the
-fast compiler flag for Fortran codes. C programmers should
specify both -fast and -xlibmopt.
LSF Queuing System
LSF (Load Sharing Facility) is the default queuing system that comes with
the Sun E10K. It works much the same as NQS or Codine, but uses different
keywords to signify options to the scheduler. The program to view queued
jobs is called
bjobs and produces a different output than qstat
does for NQS or Codine.
By default, bjobs -u all outputs
the following:
jbp@wolfe [ 335 ] % bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
924 mxxxxx RUN batch wolfe wolfe Test1 May 30 21:00
962 wxxxxx RUN batch wolfe wolfe lsf0001 May 31 08:40
965 wxxxxx RUN batch wolfe wolfe lsf0002 May 31 09:25
926 mxxxxx RUN batch wolfe wolfe Test1 May 30 22:00
968 mxxxxx PEND batch wolfe Test1 May 30 23:00
962 mxxxxx PEND batch wolfe Test1 May 31 01:00
957 txxxxx PEND ibatch wolfe BigJob May 31 10:02
The first column is the Job ID, the second column lists the user who owns
the job. The third column shows whether the job is running or queued
("pending"). The fourth column shows the queue that it was submitted to.
The fifth column indicates wherre the job was submitted from and the sixth
column shows where it is executing (or blank if the job is still pending).
The seventh column shows the job name. The last column shows when
the job was originally submitted. In the above example, only two
queues exist - batch and ibatch (or interactive-batch).
Much like NQS, you must specify a queue when you submit a job, and each
queue has certain restrictions on run-time and number of CPUs.
#!/bin/csh
#BSUB -q batch
#BSUB -c hh:mm -M X
#BSUB -P NAxxxx -J jobname
#BSUB -B -N -o stdoutfile
#BSUB -n 10
#
setenv MPI_SHORTMSGSIZE 65536
setenv MPI_PROCBIND 1
pam myprog
The -q selects which queue to submit to, -c specifies the
CPU time requested (in hours and minutes), -M sets the requested
memory size in KB, -P selects the project to run under,
-J sets the job's name, -B and -N are to send
mail at the beginning and end of the run, -o specifies a file
to place any non-redirected output to (by default, the standard error
output is sent there also, otherwise use -e to send the errors
to a different file). Finally, -n sets the requested number of
processors. Note that this can be a range: -n 4 requests
exactly 4, -n 10,20 requests between 10 and 20 (as many as
possible, but as quickly as possible too). Note that setting the
maximum number of processors too high will cause the job to be rejected.
"Interactive" Batch Jobs
In order to more fairly distribute resources, you must run all parallel
interactive jobs through LSF. I made a simple script to handle some of
the mundane details you'll need to specify each time:
#!/bin/csh
bsub -I -n $1 -q ibatch -P NAxxxx pam $argv[2-]
where NAxxxx is your project or group id (just do groups
to see which ones you are a part of).
Save this script as "irun" and make it executable
(chmod +x irun). Now you can do irun 4 myprog arg1 arg2.
This will run myprog on 4 processors, sending it the command line
arguments arg1 and arg2.
System Health & Monitoring
The basic Unix system health monitors are available through Solaris.
This includes ps -A (probably more info than you ever want or need)
and top (or top -d 1i 64 for non-interactive use).
ps -A shows all current processes by all users, including various
daemons and monitors running under root. top shows the
top several processes running on the machine, according to CPU usage.
Thus, if you have a compute-intensive task, you should see it at the top
of the list. Note that the current version of top shows CPU
usage as a percent of the whole machine (64 processors). Thus, if your
process is using 100% of 1 CPU, it will show up as 1.56% (1/64th) of
the whole machine. A nice side note: top also shows the amount
of memory being used by each process, so you can get some idea of the
"high-water" mark for your code just by watching top.
% ps -Ao "ruser,comm,class,vsz,etim,time,lwp,s"
Hardware Performance Monitoring
Some Web Pointers