Increasing Performance on Lustre File Systems
Each HPC has a large, high-performance, Lustre-based file system dedicated for the temporary storage of data produced during execution of user batch jobs. These parallel file systems achieve their performance levels by automatically dividing data into chunks and writing them across multiple disk sets, or "OSTs," simultaneously. This process, called "striping," plays a vital role in running very large jobs because it significantly improves file I/O speed, thereby reducing the time required to read or write a file. Without parallel striping, large jobs, many of which require hundreds of GBytes of disk space, would spend much of their time just reading from and writing to disk.
In the following discussion, the terms "stripe count" and "stripe size" will be used frequently. Stripe count refers to the number of stripes into which your file is divided. Each stripe of your file will reside on a different OST. Thus, if a file has a stripe count of 8, that file resides in approximately equal portions on 8 different OSTs. Stripe size refers to the amount of file system space that is allocated for your file each time it needs additional space.
How do stripe count and stripe size work together? Let's say you are writing 200 MBytes to a file that was created with a stripe count of 10 and a stripe size of 1 MByte. When the file is initially created, it will exist as 10 1-MByte blocks on 10 different OSTs. Due to the parallel nature of Lustre, 1 MByte of data are written to each of the 10 allocated blocks simultaneously. Once those 10 blocks have been filled, Lustre allocates an additional 1-MByte of space to each of the 10 blocks. The next 10 MBytes of data are written simultaneously to the newly allocated blocks of space. This process continues until the entire file has been written. Upon completion, the file will exist as 20 1-MByte blocks of data on each of 10 separate OSTs.
The following table lists technical specifications for the Lustre file systems on each HPC.
| System | File System Name | Maximum Capacity | Number of OSTs | OST Capacity | Default Stripe Count | Default Stripe Size |
|---|---|---|---|---|---|---|
| Diamond | /work | 721 TBytes | 253 | 2.85 TBytes | 6 | 1 MByte |
| Garnet | /work | 737 TBytes | 240 | 3.1 TBytes | 4 | 1 MByte |
On each HPC, the environment variable $WORKDIR refers to each user's working directory in the /work file system.
As mentioned above, one of the primary benefits to striping large files is the increased I/O performance with reading and writing. A secondary benefit is that spreading large files over multiple OSTs helps prevent the system from degrading to a state from which it cannot recover. If any one or more OSTs become too full or actually run out of space, then all programs attempting to create or write to files on those OSTs will stall. In addition, this can cause the entire system to hang and require file deletions and possibly a reboot to recover.
The default stripe counts and stripe sizes have been chosen to balance the needs of performance and available space. Setting stripe sizes lower than 1 MByte is discouraged. Remember, the stripe size multiplied by the stripe count is the minimum amount of space that will be allocated for any file. For example, a file of only 10 KBytes of actual data will still be allocated 4 MBytes of space if its stripe count is 4 and the stripe size is 1 MByte. In addition, setting the stripe count too high can actually cause degradation in I/O performance. Therefore, you are urged to be cautious in choosing new stripe specifications for your data.
In general, the default stripe specifications should be sufficient for average data needs. A good rule of thumb that can be used to determine whether or not you need to change the defaults is to simply determine the amount of data to be written to the largest files. If any file or files are larger than 10 GBytes * default stripe count, then you should consider increasing their stripe count. For example, a file larger than 60 GBytes on Diamond's /work file system would be a good candidate. A second rule of thumb that can be used to determine how many stripes to use is to divide the amount of data by 16. So, if the file size will be 320 GBytes, then a good stripe count for that file would be 20 (i.e., 320 / 16).
Stripe parameters can be set for both individual files and for directories. However, changing the stripe parameters on an existing file has no effect. You must first create an empty file with the desired striping characteristics and then write your data to it. Likewise, changing the stripe parameters on a directory does not change the striping on files already existing in that directory. Only new files created in the modified directory will inherit the changed characteristics.
The following commands show how to create an empty file named LargeFile with a stripe count of 8 and a stripe size of 1 MByte.
$ cd $WORKDIR
$ lfs setstripe LargeFile -s 1048576 -i -1 -c 8
The following commands show how to set the stripe size to 1 MByte, the stripe index to -1, and the stripe count to 16 for a new directory named LargeDir. The "-i -1" index option lets Lustre choose over which OSTs to stripe the files that will be created inside LargeDir. Note also that any subdirectories created under LargeDir will inherit its new stripe characteristics.
$ cd $WORKDIR
$ mkdir LargeDir
$ lfs setstripe LargeDir -s 1048576 -i -1 -c 16
Some additional information to keep in mind:
- Files moved into a striped directory with the mv command do not inherit the directory's stripe characteristics. Files created by using commands such as cp, cat, scp, and tar, or created during program execution will inherit the characteristics.
- It is strongly recommended that you do not create files exceeding 200 GBytes for mass storage archival. Creating files larger than 200 GBytes and sending those files to mass storage increases the chance of data corruption, results in very slow file retrieval, and may prevent the mass storage system from making a backup of your file on its backup system. Mass storage is a tape file system and is best suited for archival of large (< 200-GByte) files. Tape file systems are not suited for archiving large numbers of small files, but are better suited for archiving one large tarball containing many small files.
- The lfs setstripe command can be used within batch jobs.
- Additional information can be found by viewing the lfs man page on any HPC system.