Skip Nav

Sapphire




Cray XT3 (sapphire)
Advanced FAQs


Q. How is sapphire configured?
A. Sapphire is a Massively Parallel Processor (MPP) supercomputer that is a successor to the Cray XT3. Sapphire contains 4,160 nodes. The node pool is partitioned into compute and service partitions that are composed of 4,096 and 64 nodes, respectively. Compute nodes contain a single AMD 2.6‑GHz dual‑core Opteron processor and run a Linux microkernel called Compute Node Linux (CNL). The service nodes contain a single AMD 2.6‑GHz dual‑core Opteron running SUSE Linux and perform support functions for application and system services. All nodes are connected to each other in a three‑dimensional torus using a HyperTransport link to a dedicated Cray SeaStar communications engine. Sapphire is rated at 42.6 Peak TFLOPS and contains 374 TBytes of Fibre Channel RAID disk space.

Q. What operating system is on sapphire?
A. The Cray XT3, UNICOS/lc consists of two primary components: a microkernel for compute nodes and a full‑featured operating system for the service nodes. The XT3 CNL microkernel running on the compute nodes interacts with an application process in a very limited way by managing virtual memory addressing, providing memory protection, and performing basic scheduling. This proven microkernel architecture ensures reproducible run times for MPP jobs, supports fine‑grained synchronization at scale, and ensures high‑performance, high‑bandwidth MPI and SHMEM communication. Service nodes run a full SUSE Linux distribution with specific Cray XT3 modifications.

Q. How many nodes and cores are available for user jobs?
A. On sapphire, there are 8,192 compute cores on 4,096 nodes. The actual number of nodes available to a user is controlled by the Portable Batch System (PBS) batch queue structure. See the Cray XT3 Queue Limits Summary table for a complete queue summary.

Q. When I log in to sapphire, where am I running?
A. When you log in to sapphire, you will be running in an interactive shell on a login node. Subsequent sequential processes, such as system commands or sequential user programs, will run on the same node as your login shell.

Q. Where do my compiles execute?
A. All UNIX commands, including compile commands, execute on the login nodes.

Q. Where do my parallel programs execute?
A. Parallel programs execute on their dedicated subset of the 8,192 compute cores.

Q. Can I check the current usage of all the nodes?
A. Yes, an interactive display of node usage is available through the xtshowmesh command. See "man xtshowmesh" for details.

Q. How much memory can I use on each compute node?
A. Each compute node on sapphire contains 4 GBytes of memory. Approximately 200 MBytes are used by system processes. The two cores on a compute node share the remaining 3.8 GBytes of memory.

Q. What application software is available?
A. For a description of supported software, see the Cray XT3 Software Version Matrix. Unsupported software can be found in the directory /usr/local/usp/ on sapphire.

Q. What compilers are available?
A. The Portland Group's (PGI) Programming Environment is the default programming environment on sapphire. PathScale and GNU compilers are also available. To switch from the PGI default programming environment, use one of the following commands:

module swap PrgEnv-pgi PrgEnv-pathscale     //To switch to PathScale
module swap PrgEnv-pgi PrgEnv-gnu           //To switch to GNU

Optimization flags are different for each programming environment and can be found in the man pages for each: for PGI, man pgf90, pgf77, pgcc, or pgCC; for PathScale, man pathf90, pathcc, or pathCC; for GNU, man gfortran, g77, gcc, or g++.

To compile your code to run on the compute nodes using any of the three programming environments, use the compilers listed in the following table. Note, you will still need to issue an aprun command in a batch job to run the compiled code on the compute nodes.

Compiler Description
ftn Fortran 90/95
cc C
CC C++

You may run small applications on sapphire's login nodes if they do not run for more than a few minutes. Alternatively, you may use PBS to schedule a batch job on a single batch interactive node. The OS on the batch interactive nodes is full SUSE Linux that can run serial or threaded applications. The batch interactive nodes contain a single dual‑core 2.6‑GHz Opteron processor with about 13 GBytes of usable memory. Any of the three programming environments may be used, but you must issue the same "module swap" commands listed above if you want to compile with PathScale or GNU. To schedule a single batch interactive node, use the PBS option "-l ncpus=0".

To compile a serial or threaded code to run on a login node or on a batch interactive node, use the compilers listed in the following table.

Compiler Description
pgf90 PGI Fortran 90/95
pgf77 PGI FORTRAN 77
pgcc PGI C
pgCC PGI C++
pathf90 PathScale Fortran 77/90/95
pathcc PathScale C
pathCC PathScale C++
gfortran GNU Fortran 90/95
g77 GNU FORTRAN 77
gcc GNU C
g++ GNU C++

Q. What do I need to know about modules?
A. The modules package is a convenient way for you to modify your programming environment without directly modifying the $PATH, $MANPATH, and other environment variables. The "module list" command can be used to determine what modules are currently loaded. After a major OS upgrade the default modules are changed to the latest tested and proven OS dependent libraries. You will be notified if you need to relink and/or recompile your codes to execute under the upgraded OS.

Use "module avail" to determine what modules (and what versions of those modules) are available. Newer compiler versions are often installed for testing purposes and may become the default version at some point in the future.

To use an alternate module, the "module swap" command can be used as shown in the following example that replaces the currently loaded PrgEnv with PrgEnv.3600:

sapphire$ module swap PrgEnv PrgEnv.3600

Note libraries listed by the "module avail" command must be loaded before they can be used in compiling. Only libraries listed by the "module list" command will be automatically searched during compile and link operations. The command "module load library_name" will load any module library that is shown to be available. A list of all module keywords is given in response to "module help".

Q. What batch‑queuing system is used?
A. The Portable Batch System (PBS) is currently running on sapphire. The syntax of the command to run a parallel job is similar to other batch systems.

Q. What batch commands are available?
A. The following table lists some of the available batch commands. For more information, see the man page for each command.

Command Description
qsub Submits a batch job.
qstat Displays information about jobs and queues.
qview Displays information about jobs.
qhist Displays information about a user's jobs.
qlim Displays information about batch queues.
qdel Deletes a job.
qhold Places a hold status on a queued job.
qrls Releases a hold status on a job.

Q. How do I submit a batch job?
A. The preferred method is to embed the PBS directives within the batch request script using #PBS, as follows:

#PBS -l ncpus=4
#PBS -l walltime=4:00:00
#PBS -A project_name
#PBS -q standard

Then to submit the batch job script with the embedded PBS directives, use the following command:

qsub scriptname

This script must contain your eight‑character project name. Note: the show_usage command will generate a project list. For more information on qsub, see the qsub man page.

Q. What might a batch script for sapphire look like?
A. A sample batch script appears below, requesting eight compute cores for 1 hour.

#PBS -A project_name
#PBS -l walltime=01:00:00
#PBS -l ncpus=8
#PBS -q standard
#PBS -N myjobname
#PBS -j oe
cd $PBS_O_WORKDIR

# Make a new subdirectory in working storage space.
mkdir $WORKDIR/projA-7

# Change to the new directory.
cd $WORKDIR/projA-7

# Check DMS availability. If not available, then wait.
archive stat -s

# Retrieve executable program from the DMS.
archive get -C $ARCHIVE_HOME/project_name program.exe

# Retrieve input data file from the DMS.
archive get -C $ARCHIVE_HOME/project_name/input data.in

# Execute a parallel program.
aprun -n 8 my_program < data.in > projA-7.out

# Check DMS availability. If not available, then wait.
archive stat -s

# Create a new subdirectory on the DMS.
archive mkdir -C $ARCHIVE_HOME/projA output7

# Transfer output file back to the DMS.
archive put -C $ARCHIVE_HOME/project_name/output7 projA-7.out

# Clean up unneeded files from working storage.
cd $WORKDIR
rm -r projA-7

When the script executes, it first changes directory to where the job was submitted (cd $PBS_O_WORKDIR). It then makes a run directory under $WORKDIR, changes to the newly created run directory, and gets both the executable file and input file from DMS. The script runs the executable on eight cores. Once the execution is done, the script archives the results back to DMS and cleans up the run directory.

Q. What is the aprun command?
A. The aprun utility loads and executes a program on one or more compute cores. The aprun utility reads the executable, obtains compute cores for it to run on, sends the application to the compute cores, and launches the application.

Q. After a batch job completes, where does the output go?
A. Standard error and standard out are written to the files specified by the "-e" and "-o" options, respectively. By default, these files are created in the directory from which the job was submitted. Specify the file names with full paths in order for them to be created in a different location. The default standard error and standard output file names are jobname.ejobID and jobname.ojobID, respectively.

For more information about standard error and standard out file naming, see the qsub man page.

Q. Is there a way to change directories automatically at job start?
A. No, PBS always begins execution in $HOME. Insert the command "cd $PBS_O_WORKDIR" immediately after the last PBS directive to change the current working directory to the location from which the job was submitted.

Q. How do I merge stderr and stdout in PBS?
A. The qsub option "-j" will cause the stderr and stdout streams to be merged into a single file. Specify "#PBS -j oe" to send the merged output to the stdout file; specify "#PBS -j eo" to send the merged output to the stderr file. This is detailed in the qsub man page.

Q. What queues are available?
A. Sapphire contains a Batch Queue structure essentially identical to the other ERDC DSRC systems. See the Cray XT3 Queue Limits Summary table for a complete queue summary.

Q. How do I determine the status of the queues?
A. Use the qlim, "qstat -Q", or "qstat -q"commands to see the available queues.

Q. How do I monitor batch jobs?
A. Use qview or qstat to list the status of all current batch jobs. For more information on a particular batch job, use "qstat -f job_id" where job_id is found in the output from the qview command.

Q. How do I cancel a batch job?
A. To cancel a job, use the "qdel job_id" command where job_id is found in the output from the qview command.

Q. How much space is in my $HOME directory?
A. Sapphire users are typically allocated 1 GByte of disk space in their home directory. This space can be accessed using the $HOME environment variable. Requests for more home directory space will be considered on a case-by-case basis. A much greater amount of disk space is available for temporary use in your $WORKDIR directory.

Q. What are $WORKDIR and $WORKDIR2?
A. $WORKDIR is a directory in the /work file system where you can temporarily store large amounts of data for job execution.

$WORKDIR2 is a second temporary storage directory in the /work2 file system. /work2 has been configured to provide increased I/O performance on large files and on files that are written using large record lengths. For example, $WORKDIR2 is a good location to create large tar files of data that resides in $WORKDIR.

Both $WORKDIR and $WORKDIR2 are automatically created for you upon login if they do not already exist.

Q. Are $WORKDIR and $WORKDIR2 local to each node?
A. No, all the nodes on sapphire see the same file systems. No disks are local to any node. Therefore, the /work and /work2 file systems are shared.

Q. What is /tmp and how is it used?
A. /tmp is a very small memory‑resident nonpermanent file system used during program execution. You should not use /tmp. It should be reserved for system processes.

Q. Which file system should be used for running I/O‑intensive jobs?
A. The /work filesystem has the highest file transfer rates.

Q. How do I check my disk usage?
A. You can check your disk usage by using the command "show_storage". This returns the size of your home, archive, and work directories in megabytes.

Q. Can I use the Data Management System (DMS) from sapphire?
A. Yes, you can confirm availability of the DMS, and get and put files, using the archive command. See the man page for details.

Q. What is the XT3 programming environment?
A. The Cray XT3 programming environment includes tools designed to facilitate the development of scalable applications. The Opteron processor's native support for 32‑bit and 64‑bit applications and full x86‑64 compatibility makes the XT3 system compatible with many existing compilers and libraries, including optimized C, C++, and Fortran90/95 compilers and high‑performance numerical libraries such as optimized versions of BLAS, FFTs, LAPACK, ScaLAPACK, and SuperLU.

Communication libraries include MPI and SHMEM. The MPI implementation is compliant with the MPI 2.0 standard and is optimized to take advantage of the scalable interconnect in the XT3 system. The SHMEM library is compatible with previous Cray systems.

Q. What are the sizes of the standard C, C++, and Fortran data types?
A. Data type precision for sapphire is IEEE‑compliant. (See the table below for more information.) For your convenience, the following C/C++/Fortran data size summary is provided:

Data Size Summary
C/C++ TypeFortranXT3 Size in Bits
char   8
  Integer*1 8
short   16
  Integer*2 16
int Integer*4 32
long Integer*8 64
long long   64
pointer Integer*8 64
float Real*4 32
double Real*8 64
long double Real*16 128
float complex Complex*4 32 x 2
double complex Complex*8 64 x 2
long double complex Complex*16 128 x 2

Q. What parallel programming models are available?
A. Sapphire supports the Message Passing Interface, version 2 (MPI), SHared MEMory (SHMEM) and OpenMP shared memory on a node.

Q. What numerical libraries are available?
A. The XT3 provides a 64‑bit AMD Core Math Library (ACML) and LibSci. ACML provides Level 1, 2, and 3 BLAS; a full suite of Linear Algebra (LAPACK) routines and a suite of Fast Fourier Transform (FFT) routines for single‑precision, double‑precision, single‑precision complex, and double‑precision complex data types. LibSci provides ScaLAPACK, BLACS, and SuperLU routines, which are not included in the ACML library.

For additional information, you may refer to the AMD Core Math Library External Link and the Cray XT Series Programming Environment User's Guide External Link, available on‑line from Cray.

Q. Are special actions required to access the double‑precision numerical libraries?
A. No, the compilers should automatically resolve any precision issues.

Q. What is MPI, and how do I use it?
A. Message Passing Interface (MPI) is the de facto standard library for portable message‑passing programming. MPICH2 is the implementation on sapphire. It is MPI 2.0‑compliant except for Dynamic process creation. In order to use MPI routines within your program, you must add a line in the code to reference the MPI header file, as shown in the following examples:

In C, add the following line to the program and compile:

#include <mpi.h>
cc mpi_program.c

In Fortran, add the following line to the program and compile:

INCLUDE "mpif.h"
ftn mpi_program.f

Execute MPI programs as you would other parallel programs. For more information on MPI, see the MPI home page External Link available on‑line from Argonne National Laboratory. For more information on the Cray implementation of MPI, see the Cray XT Series Programming Environment User's Guide External Link, available on‑line from Cray.

Q. What is OpenMP, and how do I use it?
A. OpenMP is a shared‑memory parallel‑programming interface. OpenMP uses directives inserted into your code to define areas of your code to be executed in parallel by use of threads. The directives are typically placed at the start and end of large do-loops that have enough iteration independence to be performed in parallel by separate threads. The OpenMP directives can also define which variables can be shared in memory or must be private in memory between the threads. Each compute node on sapphire has only one dual‑core Opteron processor where the cores share the memory. It only makes sense to run with four threads on sapphire. To compile an OpenMP code with the default PGI compiler, add the "-mp=nonuma" option, for a GNU compile add "-fopenmp", and for the PathScale compiler add "-mp".

An example OpenMP code:

    program omptest
    implicit none
    REAL*8x(20000)
    integer i,n,nt
    inteter OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
!$omp parallel private(i,n) shared(x,nt)
      n=OMP_GET_THREAD_NUM()
      nt=OMP_GET_NUM_THREADS()
      print *, "Thread number",n," of",nt
      do j=1,20000
        x(j) =  dfloat(j) * 3.14
      enddo
!$omp end parallel
    end
    

The following example shows how to run an OpenMP parallel batch program on a compute node:

# Request one node
PBS -l ncpus=2

# Set the number of threads to 2
export OMP_NUM_THREADS=2
aprun -n 1 -d 2 ./my_openmp.exe

Q. What is SHMEM, and how do I use it?
A. SHMEM is a library supporting one‑sided communication, available on systems from SGI and Cray. The SHMEM library is the most efficient means of PE communication available on sapphire. In order to use SHMEM routines within your programs, you must add a line in the code to reference the SHMEM header file and add an option to the compile command to reference the shared‑memory library, as shown in the following examples:

In C, add the following line to the program and compile:

#include <mpp/shmem.h>
cc -l sma shmem_program.c

In Fortran, add the following line to the program and compile:

INCLUDE "mpp/shmem.fh"
ftn -l sma shmem_program.f

Run SHMEM programs as you would other parallel programs. See "man intro_shmem" for details on the SHMEM library. For more information on the performance and use of SHMEM calls, see the Cray XT Series Programming Environment User's Guide External Link, available on‑line from Cray.

Q. What Fortran standards are supported?
A. The Portland Group Fortran 90/95 Compiler used on sapphire provides the full ANSI Programming Languages capabilities of FORTRAN 77, Fortran 90, and Fortran 95 with a comprehensive set of Fortran extensions.

Q. How do I access the GNU version of make?
A. GNU make is the default make command on sapphire. It is automatically included in your $PATH.

Q. What programming and performance analysis tools are available?
A. TotalView and GNU debuggers, CrayPat, PAPI, and Apprentice2™ performance analysis tools are available on sapphire. See "man totalview" for details. Additional information on TotalView can be found through the Documentation tab on the Totalview Technologies External Link website. For additional details about CrayPat, see the pat_build and pat_report man pages. The GNU debugger, gdb is also available. See the gdb man pages for details.

Q. How do I debug parallel programs?
A. Sapphire provides the TotalView debugger. See the man page for more information.

Q. Where do I run TotalView?
A. If Totalview is run from the login nodes, you can debug serial codes compiled using the pgf90, pgcc, etc. compilers (with -g option). Totalview is an X11 application and requires an Xterm session. To run a TotalView (serial) job on a login node, use the following command:

totalview ./a.out

If you need to debug a parallel code, compile it for the compute nodes (using the -g option) and then follow the steps outlined here.

Q. How do I analyze the performance of parallel programs?
A. The programming and performance analysis tools described above all support parallel programs. In addition, they are all "post mortem"; execution of a program produces an analysis file, and this file is used by the tool well after the program finishes. Therefore, the performance‑analysis tools are effective within a batch environment. Simply run each parallel program as a batch job, remembering to copy the necessary analysis files to permanent storage within the batch script. After a particular job completes, use the desired tool to interpret the analysis files.

Q. How do I run on a single core per node?
A. If more than 2 GBytes of memory per MPI process is required, then running on a single core per node can provide up to 3.8 GBytes of memory per MPI process. Since nodes can not be shared with other users, running in this configuration will keep one core per node active and the other core will be idle. The active core has access to all the memory on the node. For an example of running with 32 MPI processes on 32 nodes, the PBS directive and aprun options required are:

#PBS -l ncpus=64
aprun -n 32 -N 1 ./a.out

The PBS directive will allocate 64 cores on 32 nodes, but the aprun command option "-N" forces only one core per node to be active.

Q. What login shells are available?
A. The following shells are available on all of our systems: bash, csh, ksh, tcsh, and sh. If you don't request a specific shell on your account application, you are assigned the tcsh shell by default.

Q. How can I change my login shell?
A. You can contact our Service Center via e‑mail, phone, or walk‑in to have your default shell changed.

Q. What commands are available to provide information on the entire system?
A. Regular Unix commands only work on the specific login node into which you are logged. The following commands allow operations to and provide information on the entire system. Further information and command syntax can be found using the system's man page utility.

Command Description
xtshowmesh Shows information about compute and service partition nodes and the jobs running in each partition.
xtshowcabs Shows information about compute and service nodes organized by chassis and cabinet.
xthostname Displays or sets the xthostname value.

Last update: July 10, 2009

You are accessing a U.S. Government (USG) Information System (IS) that is provided for USG-authorized use only. By using this IS (which includes any device attached to this IS), you consent to the following conditions: * The USG routinely intercepts and monitors communications on this IS for purposes including, but not limited to, penetration testing, COMSEC monitoring, network operations and defense, personnel misconduct (PM), law enforcement (LE), and counterintelligence (CI) investigations. * At any time, the USG may inspect and seize data stored on this IS. * Communications using, or data stored on, this IS are not private, are subject to routine monitoring, interception, and search, and may be disclosed or used for any USG- authorized purpose. * This IS includes security measures (e.g., authentication and access controls) to protect USG interests--not for your personal benefit or privacy. * Not withstanding the above, using this IS does not constitute consent to PM, LE or CI investigative searching or monitoring of the content of privileged communications, or work product, related to personal representation or services by attorneys, psychotherapists, or clergy, and their assistants. Such communications and work product are private and confidential.