Table of Contents
- An Overview of Parallel File Systems
- NCSA Parallel File Systems
- Tips on Using NCSA's Parallel File Systems
An Overview of Parallel File Systems
High-performance parallel file systems are the next generation accompaniment
to existing High Performance Computing (HPC) systems. The ability to
handle large data sets efficiently and to make several terabytes of
fast disk visible to large multi-teraflop computing clusters is the
primary goal of parallel file systems research at NCSA.
NCSA is continually deploying and
testing parallel file systems on
different architectures and with different disk configurations. NCSA's
Mercury TeraGrid cluster has one of the world's largest SAN fabrics ever
deployed. Parallel file systems are at the bleeding edge of
HPC research. To date, most parallel file system
configurations have been considered unstable and used with caution
(primarily for short term scratch use). But, as more experience is gained
through deployment in demanding production environments; stable
implementations of parallel file systems will eventually be used for most
aspects of file management on HPC systems.
Parallel file systems offer numerous advantages and address some key issues
by providing:
- Concurrent access to files by multiple nodes of
a cluster. This prevents
users from having to utilize the local disk of each node and then reassemble
the output into either a coherent single file or a collection of multiple
files (sometimes referred to as post-mortem reassembly).
- Scalable performance. Parallel file systems are designed
with scalability in mind. As clusters grow, more disk and more network
connections need to be incorporated into the fabric of a file system.
- A single disk space where serial files and files created
by parallel applications can coexist and be manipulated.
It is clear that parallel file systems satisfy many of the requirements of
modern high-performance computing. Parallel file systems provide a simple,
traditional UNIX file system interface to the complicated underlying file
storage methods. Large-scale deployments in the
tens and hundreds of terabytes are pushing software designs and network hardware
to their limit.
Deploying stable high-performing parallel file systems of this scale remains
a challenge, but it will help make massive data sets more useful and easier
to manipulate.
Feedback from users about problems or suggestions relating to I/O performance
will help us shape the
future requirements of HPC-I/O and facilitate data-intensive work.
NCSA Parallel File Systems
|
SYSTEM
|
File System
|
Capacity (TB)
|
Mount Point(s)
|
Characteristics and Performance
|
Raw |
Formatted |
| Mercury (login-hg.ncsa.teragrid.org)
[User
Docs] |
| |
NFS |
18 |
1
|
/scratch
|
Cluster-wide scratch area. Use for testing and small jobs. Not a high-throughput
parallel file system.
|
|
5.5
|
/home
/usr/local
|
Cluster wide access to libraries and programs.
|
| — |
— |
Testing.
|
GPFS (NSD)
|
50
|
39
|
/gpfs_scratch1/
|
Accessible from login, GridFTP and all compute nodes.
Recommended for I/O of less than 1TB. Good all-purpose
parallel file system. Much better performance than NFS.
|
LUSTRE
(Test)
|
20
|
— |
— |
Testbed for future LUSTRE implementations.
|
| Tungsten (login-w.ncsa.teragrid.org) [File
System Details] [User
Docs] |
| |
NFS
|
15
|
5.4
|
/nfs/scratch
|
Cluster-wide scratch area. Use for testing and small jobs. Not a high-throughput
parallel file system.
|
|
5.5
|
/u
|
Cluster-wide access to home directories.
|
|
1
|
/usr/local
/usr/apps
|
Cluster-wide access to libraries and programs.
|
|
LUSTRE
|
140
|
59
|
/cfs/scratch/users/
|
Available for batch use and accessible from login nodes. Recommended file
system for batch jobs. Striping (parallel access) to one file is disabled by default.
|
|
49
|
/cfs/projects/ widescratch/users/
|
Available for batch use and accessible from login nodes.
Striping is enabled for parallel access to a single file.
|
| Copper (login-cu.ncsa.teragrid.org)
[User
Docs] |
| |
GPFS
(AIX)
|
50
|
35
|
/scratch/users/ /scratch/batch/ /u
|
Cluster-wide accessibility to GPFS under IBM's native AIX architecture.
|
| Cobalt (login-co.ncsa.teragrid.org)
[User
Docs] |
| |
CXFS
|
370
|
100
|
All mountpoints
|
Cluster file system. Allows proprietary SGI linkage (high performance) as
well as traditional NFS (high availability). Scalable bandwidth when accessed from
multiple compute nodes.
|
General Parallel File System (GPFS)
General Parallel File System
(GPFS)
is a parallel file system package developed by IBM. It was originally developed
for IBM's AIX operating system then ported to Linux Systems.
Features:
- Appears to work just like a traditional UNIX file system from the
user application level.
- Provides additional functionality and enhanced performance when
accessed via parallel interfaces such as MPI-I/O.
- High performance is obtained by GPFS by striping data across multiple
nodes and disks.
- Striping is performed automatically at the block level. Therefore,
all files (larger than the designated block size) will be striped.
- Can be deployed in NSD or SAN configurations.
- Clusters hosting a GPFS file system can allow other clusters
at different geographical locations to mount that file system.
- Metadata is stored on all nodes.
Local Configuration and Availability:
- GPFS is available on the Mercury TeraGrid cluster and on the
IBM p-series (Copper).
- GPFS is not global to all NCSA machines. The GPFS file systems on Copper
are local to each cluster.
Simple GPFS NSD Configuration
In a NSD configuration, a subset of nodes are designated
as I/O nodes or "NSD servers". Depending on performance and/or
redundancy requirements the I/O node to compute node ratio can vary. The
fact that I/O takes place remotely is transparent to applications running
on the compute nodes.
Simple GPFS SAN Configuration
In a SAN configuration, all nodes have a direct connection to the
disk arrays. This configuration permits very high I/O throughput to disk
and scales very well as the number of processes increase.

Linux Cluster File System (LUSTRE)
LUSTRE, also known as the Linux Cluster
File System, is an open-source distributed file system.
Features
- Appears to work like a traditional UNIX file system (similar to GPFS).
- Distributed Object Storage Targets are responsible for actual file-to-disk
transactions.
- A user level library is available to allow application I/O requests to
be translated into LUSTRE calls
- As with other parallel file systems, data striping from concurrently
running nodes is the main performance enhancing factor.
- Metadata is provided from a separate server.
Local Configuration and Availability
- LUSTRE is deployed on Tungsten; NCSA's Linux ia32 cluster.
- There are currently 104 OST's serving the various "sub-clusters" that
comprise Tungsten(tuna, tunb, tunc, tund, tune).
- Striping of large files across multiple OST's can be enable on a
per-file or per-directory basis.
- Good parallel performance can be obtained when multiple processes
write individual files, or in striped mode, when accessing a single
file concurrently.
- More implementation details and the current status of LUSTRE on Tungsten
can be found on the
Tungsten File Systems Overview page.
SGI's Cluster CXFS File System
CXFS is
SGI's latest shared file system based on XFS.
Features
- As with GPFS and LUSTRE, CXFS appears to work exactly like a traditional
UNIX file system.
- Additional libraries are provided to invoke non-buffered direct I/O for
very large memory intensive applications. (see man intro-ffio for
details)
- Metadata is handled by a centralized server.
Local Configuration and Availability
- CXFS is deployed on Cobalt; NCSA's Linux ia64 SGI Altix system.
- All mounted file systems on Cobalt are on CXFS including /home directories.
- CXFS supports concurrent access to files from multiple compute nodes.
- Both MPI and OpenMP applications can improve performance by parallelizing
I/O on CXFS.
Tips on Using NCSA's Parallel File System
In general, performance will improve if I/O operations are performed in a
directory that is mounted on a parallel file system. Changing to the appropriate
scratch file system while running a job is the first step in improving
I/O. For parallel applications,
the goal is to have many nodes concurrently performing I/O
operations, which will improve performance up to the point that
the particular network and hardware configurations of the file system will
permit.
If it is determined that a particular application is spending a significant amount of
time performing I/O, dividing the load among more nodes on the cluster will
increase bandwidth to the file system. Consider writing one file per process rather
than allowing the I/O to be serialized. If done carefully, files can still be concatenated
for portability on other systems. Routines can be written to support reading a set of
files from a different number of processes than they were created with. MPI codes can
take advantage of MPI-I/O routines that allow concurrent reading and writing to one file.
These routines improve performance greatly over serial I/O on most of NCSA's file systems.
Parallel I/O performance is an active field of research. Ultimately, file system
details and I/O optimization parameters will be transparently incorporated
into most common scientific I/O libraries. Much work still needs to be done
before this becomes a reality. Details such as buffer sizes and striping methods are
inherently dependent on the underling hardware configurations and thus platform
dependent.