Notes on the Sun E10K

John B. Pormann
May 2000


Table of Contents


Introduction

The Sun Enterprise 10000 is a shared memory multiprocessor system using workstation-class components and a high-bandwidth memory crossbar. The current implementation of the E10K installed at the NAVO has 64 CPUs and 64 GB of total memory.

The E10K at NAVO has 400 MHz UltraSPARC (v.9) CPUs. Each CPU has a 16 KB Level-1 data cache and a 4 MB Level-2 cache. The UltraSPARC-II is a pipelined superscalar architecture which can issue 2 floating-point instructions per clock cycle. Thus the maximum sustained performance is 2 times the clock rate (ie. up to 0.8 GFLOPS for the 400 MHz chips).

The Sun E10K uses a proprietary crossbar (called the Gigaplane-XB) to connect CPUs to memory. This interconnect has a bandwidth of 12.8 GB/sec.

REALLY REALLY Quick Introduction


Sun E10K Roadmap

From Sun's website, the roadmap for both the Sun E10K and SPARC architecture looks promising.
 


.cshrc & .login Files

The following are useful additions to make to your .cshrc and .login files:

.cshrc:

set path = ( /usr/local/bin /bin /usr/bin /usr/sbin /usr/ccs/bin \
    /opt/SUNWspro/bin /opt/SUNWhpc/bin /opt/SUNWlsf/bin \
    /usr/openwin/bin /usr/ucb . )
setenv MANPATH /usr/share/man:/usr/local/man:/opt/SUNWspro/man:\
    /opt/SUNWlsf/man:/opt/SUNWhpc/man
setenv LD_LIBRARY_PATH /opt/SUNWhpc/lib:/usr/openwin/lib
setenv OPENWINHOME /usr/openwin
.login:
if( $?LSB_QUEUE ) exit


Compilers

For parallel codes, Sun provides mpcc and mpf77 wrappers to their sequential compilers. Thus, you will not need to "manually" include the location for the header. This is otherwise found in /opt/SUNWhpc/include. However, you do need to include the MPI library with -lmpi (this is normally found in /opt/SUNWhpc/lib, 64-bit versions are found in /opt/SUNWhpc/lib/sparcv9). You should also make sure that you are using the compilers in /opt/SUNWspro/bin for best performance (type "which cc" to see which compiler is earliest in your path).

Typical compiler optimization switches are:

mpcc -fast -xprefetch -o foo foo.c -lmpi

The -fast option is a shortcut for -xO4 -xtarget=native -fns -fsimple=1 -dalign -ftrap=%none -xlibmil -fsingle. Note that some of these options turn off certain IEEE floating point exceptions.

The -xO4 tells the compiler to generate optimized code (level 5 is the highest setting). Note that -xO4 has the potential to alter the semantics of your code and thus may cause errors at run-time!!

The -xprefetch tells the compiler to generate code which will prefetch memory references (if the architecture supports it, UltraSPARC-II and higher).

The -fsingle option can be extremely useful if your code does a lot of single-precision floating point work. By default, the single-precision variables will be upgraded to double-precision, the work will be performed in double-precision, then downgraded to a single-precision result. By selecting -fsingle, all work will be done in single-precision, and the conversion steps are avoided.

For Fortran 90 programs, you need to specify -fixed for fixed-format Fortran lines or -free for free-form.

Note that there have been significant changes made to the V.8, V.8plus, V.9, etc. rev's of the SPARC chips.  You should therefore make sure that you target the proper architecture that you will be running on (or use "native" if you will only be using one type of SPARC system).  Use of -xtarget=native will also insure that the compiler uses the proper values to optimize for the chip's cache sizes.

To create 64-bit code, use "-xarch=v9" or "-xarch=v9a" on the compile line, but make sure you put this after the -xtarget option. Thus, you can compile 32-bit code by setting -xtarget=native; to compile 64-bit code you need -xtarget=native -xarch=v9. The mpcc compiler will automatically link in the proper MPI library.

For OpenMP codes, you'll need to turn on smp support in the compiler with the -xautopar -xparallel -xexplicitpar options.  According to the man pages, you should avoid using -xparallel if you do your own thread management.

If you use both OpenMP and MPI in the same code, you will need to link with the multi-threaded version of MPI by using -lmpi_mt.


Useful Libraries

The usual scientific libraries come with Sun's Performance Workshop. These include LAPACK, BLAS 1/2/3, and FFTPACK. There is also an optimized math library libmopt that is included automatically with the -fast compiler flag for Fortran codes. C programmers should specify both -fast and -xlibmopt.

Sun also provides a vector math library for evaluating common mathematical functions on entire arrays of arguments. To use these libraries, link with libmvec or libmvec_mt for multi-threaded codes. More information on the specific routines can be found on the libmvec man page.

For 32- and 128-bit datatypes, you should link with the libsunmath library for best performance. For 64-bit datatypes (or to have 32-bit values automatically promoted to 64-bit), just use the standard libm or libmopt libraries.

Finally, Sun provides the Scalable Scientific Subroutine Library (S3L) for parallel codes. Link with -ls3l to include these routines. Details on how to use the S3L library is beyond the scope of this page, see the S3L User's Guide.


LSF Queuing System

LSF (Load Sharing Facility) is the default queuing system that comes with the Sun E10K. It works much the same as NQS or Codine, but uses different keywords to signify options to the scheduler. The program to view queued jobs is called bjobs and produces a different output than qstat does for NQS or Codine. By default, bjobs -u all outputs the following:
jbp@wolfe [ 335 ] % bjobs -u all
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
924   mxxxxx   RUN   batch      wolfe       wolfe       Test1      May 30 21:00
962   wxxxxx   RUN   batch      wolfe       wolfe       lsf0001    May 31 08:40
965   wxxxxx   RUN   batch      wolfe       wolfe       lsf0002    May 31 09:25
926   mxxxxx   RUN   batch      wolfe       wolfe       Test1      May 30 22:00
968   mxxxxx   PEND  batch      wolfe                   Test1      May 30 23:00
962   mxxxxx   PEND  batch      wolfe                   Test1      May 31 01:00
957   txxxxx   PEND  ibatch     wolfe                   BigJob     May 31 10:02
The first column is the Job ID, the second column lists the user who owns the job.  The third column shows whether the job is running or queued ("pending"). The fourth column shows the queue that it was submitted to.  The fifth column indicates wherre the job was submitted from and the sixth column shows where it is executing (or blank if the job is still pending).  The seventh column shows the job name.  The last column shows when the job was originally submitted.  In the above example, only two queues exist - batch and ibatch (or interactive-batch).  Much like NQS, you must specify a queue when you submit a job, and each queue has certain restrictions on run-time and number of CPUs.

To submit jobs, use the bsub program.  Much like NQS and Codine, you can submit jobs directly from the command line with bsub, but it is much easier to create a script to run your job, then submit the script.  As with NQS and Codine, LSF allows you to embed queuing options into your scripts.  Where NQS uses #QSUB and Codine uses #$, LSF uses #BSUB.  In most cases, there is a 1-to-1 replacement of NQS/Codine keywords with LoadL keywords. Here's a good starting point:

#!/bin/csh
#BSUB -q batch
#BSUB -c hh:mm -M X
#BSUB -P NAxxxx -J jobname
#BSUB -B -N -o stdoutfile
#BSUB -n 10
#
setenv MPI_SHORTMSGSIZE 65536
setenv MPI_PROCBIND 1
pam myprog
The -q selects which queue to submit to, -c specifies the CPU time requested (in hours and minutes), -M sets the requested memory size in KB, -P selects the project to run under, -J sets the job's name, -B and -N are to send mail at the beginning and end of the run, -o specifies a file to place any non-redirected output to (by default, the standard error output is sent there also, otherwise use -e to send the errors to a different file). Finally, -n sets the requested number of processors. Note that this can be a range: -n 4 requests exactly 4, -n 10,20 requests between 10 and 20 (as many as possible, but as quickly as possible too). Note that setting the maximum number of processors too high will cause the job to be rejected.

Use bqueues to find out about what queues are available (or bqueues -l for more detailed information).

"Interactive" Batch Jobs

In order to more fairly distribute resources, you must run all parallel interactive jobs through LSF. I made a simple script to handle some of the mundane details you'll need to specify each time:
#!/bin/csh
bsub -I -n $1 -q ibatch -P NAxxxx pam $argv[2-]
where NAxxxx is your project or group id (just do groups to see which ones you are a part of). Save this script as "irun" and make it executable (chmod +x irun). Now you can do irun 4 myprog arg1 arg2. This will run myprog on 4 processors, sending it the command line arguments arg1 and arg2.


System Health & Monitoring

The basic Unix system health monitors are available through Solaris. This includes ps -A (probably more info than you ever want or need) and top (or top -d 1i 64 for non-interactive use). ps -A shows all current processes by all users, including various daemons and monitors running under root. top shows the top several processes running on the machine, according to CPU usage. Thus, if you have a compute-intensive task, you should see it at the top of the list. Note that the current version of top shows CPU usage as a percent of the whole machine (64 processors). Thus, if your process is using 100% of 1 CPU, it will show up as 1.56% (1/64th) of the whole machine. A nice side note: top also shows the amount of memory being used by each process, so you can get some idea of the "high-water" mark for your code just by watching top.

ps can provide more detailed information by using the -o (output format) option. For example:

  % ps -Ao "ruser,comm,class,vsz,etim,time,lwp,s"

From the man page, this command will output (in order) the real user ID, the command name, the job class, total memory size (KB), elapsed time since the start of the program, cumulative CPU time, the LWP ID, the state of the process. The process state is one of "O" (the process is running), "S" (sleeping), "R" (on the run queue), "Z" (zombie), or "T" (stopped). The output will probably scroll off the screen, but it is easily parsed by awk, grep, or perl scripts. (This info was lifted from the psmap shell script from Alan Wallcraft at NRL)

Since most of the machine will be used through the LSF queuing system, the bjobs command is quite useful to see how loaded the machine is. The most useful information comes from bjobs -u all or bjobs -u all -r -p -s. This prints jobs from all users, including -r running jobs, -p pending jobs, and -s suspended jobs. You can also use -a to print all jobs, but this will include recently completed jobs as well (which are no longer loading the machine). Use bjobs -u all -l for highly detailed information (this is barely human-readable, but easily parsed by Perl scripts into a more readable format).

A real quick snapshot can be obtained with lsload. This will return the 15 second, 1 minute, and 15 minute load averages for the machine. lsmon will continually update and display this information in text, xlsmon will display it graphically.

Finally, psrinfo may be of some use if you want to know how many nodes are currently up or down (psrinfo -v also shows each CPU's type and speed).


Hardware Performance Monitoring


Some Web Pointers

http://www.sun.com/servers/highend/10000/
Sun's main web page for the Enterprise 10000
http://docs.sun.com/
Sun documentation