NCSA Home
Contact Us | Intranet | Search

NCSA Parallel File Systems

Table of Contents

  1. An Overview of Parallel File Systems
  2. NCSA Parallel File Systems
  3. Tips on Using NCSA's Parallel File Systems

An Overview of Parallel File Systems

High-performance parallel file systems are the next generation accompaniment to existing High Performance Computing (HPC) systems. The ability to handle large data sets efficiently and to make several terabytes of fast disk visible to large multi-teraflop computing clusters is the primary goal of parallel file systems research at NCSA.

NCSA is continually deploying and testing parallel file systems on different architectures and with different disk configurations. NCSA's Mercury TeraGrid cluster has one of the world's largest SAN fabrics ever deployed. Parallel file systems are at the bleeding edge of HPC research. To date, most parallel file system configurations have been considered unstable and used with caution (primarily for short term scratch use). But, as more experience is gained through deployment in demanding production environments; stable implementations of parallel file systems will eventually be used for most aspects of file management on HPC systems.

Parallel file systems offer numerous advantages and address some key issues by providing:

  • Concurrent access to files by multiple nodes of a cluster. This prevents users from having to utilize the local disk of each node and then reassemble the output into either a coherent single file or a collection of multiple files (sometimes referred to as post-mortem reassembly).
  • Scalable performance. Parallel file systems are designed with scalability in mind. As clusters grow, more disk and more network connections need to be incorporated into the fabric of a file system.
  • A single disk space where serial files and files created by parallel applications can coexist and be manipulated.

It is clear that parallel file systems satisfy many of the requirements of modern high-performance computing. Parallel file systems provide a simple, traditional UNIX file system interface to the complicated underlying file storage methods. Large-scale deployments in the tens and hundreds of terabytes are pushing software designs and network hardware to their limit. Deploying stable high-performing parallel file systems of this scale remains a challenge, but it will help make massive data sets more useful and easier to manipulate.

Feedback from users about problems or suggestions relating to I/O performance will help us shape the future requirements of HPC-I/O and facilitate data-intensive work.


NCSA Parallel File Systems

SYSTEM File System
Capacity (TB)
Mount Point(s) Characteristics and Performance
Raw
Formatted
Mercury (login-hg.ncsa.teragrid.org) [User Docs]
  NFS 18 1 /scratch Cluster-wide scratch area. Use for testing and small jobs. Not a high-throughput parallel file system.
5.5 /home
/usr/local
Cluster wide access to libraries and programs.
Testing.
GPFS
(NSD)
50 39 /gpfs_scratch1/ Accessible from login, GridFTP and all compute nodes. Recommended for I/O of less than 1TB. Good all-purpose parallel file system. Much better performance than NFS.
LUSTRE
(Test)
20
Testbed for future LUSTRE implementations.
Tungsten (login-w.ncsa.teragrid.org) [File System Details] [User Docs]
  NFS 15 5.4 /nfs/scratch Cluster-wide scratch area. Use for testing and small jobs. Not a high-throughput parallel file system.
5.5 /u Cluster-wide access to home directories.
1 /usr/local
/usr/apps
Cluster-wide access to libraries and programs.
LUSTRE 140 59 /cfs/scratch/users/ Available for batch use and accessible from login nodes. Recommended file system for batch jobs. Striping (parallel access) to one file is disabled by default.
49 /cfs/projects/ widescratch/users/ Available for batch use and accessible from login nodes. Striping is enabled for parallel access to a single file.
Copper (login-cu.ncsa.teragrid.org) [User Docs]
  GPFS
(AIX)
50 35 /scratch/users/
/scratch/batch/
/u
Cluster-wide accessibility to GPFS under IBM's native AIX architecture.
Cobalt (login-co.ncsa.teragrid.org) [User Docs]
  CXFS 370 100 All mountpoints Cluster file system. Allows proprietary SGI linkage (high performance) as well as traditional NFS (high availability). Scalable bandwidth when accessed from multiple compute nodes.



General Parallel File System (GPFS)

General Parallel File System (GPFS) is a parallel file system package developed by IBM. It was originally developed for IBM's AIX operating system then ported to Linux Systems.

Features:

  • Appears to work just like a traditional UNIX file system from the user application level.
  • Provides additional functionality and enhanced performance when accessed via parallel interfaces such as MPI-I/O.
  • High performance is obtained by GPFS by striping data across multiple nodes and disks.
  • Striping is performed automatically at the block level. Therefore, all files (larger than the designated block size) will be striped.
  • Can be deployed in NSD or SAN configurations.
  • Clusters hosting a GPFS file system can allow other clusters at different geographical locations to mount that file system.
  • Metadata is stored on all nodes.

Local Configuration and Availability:

  • GPFS is available on the Mercury TeraGrid cluster and on the IBM p-series (Copper).
  • GPFS is not global to all NCSA machines. The GPFS file systems on Copper are local to each cluster.

Simple GPFS NSD Configuration

In a NSD configuration, a subset of nodes are designated as I/O nodes or "NSD servers". Depending on performance and/or redundancy requirements the I/O node to compute node ratio can vary. The fact that I/O takes place remotely is transparent to applications running on the compute nodes.

 

Simple GPFS SAN Configuration

In a SAN configuration, all nodes have a direct connection to the disk arrays. This configuration permits very high I/O throughput to disk and scales very well as the number of processes increase.

 


Linux Cluster File System (LUSTRE)

LUSTRE, also known as the Linux Cluster File System, is an open-source distributed file system.

Features

  • Appears to work like a traditional UNIX file system (similar to GPFS).
  • Distributed Object Storage Targets are responsible for actual file-to-disk transactions.
  • A user level library is available to allow application I/O requests to be translated into LUSTRE calls
  • As with other parallel file systems, data striping from concurrently running nodes is the main performance enhancing factor.
  • Metadata is provided from a separate server.

Local Configuration and Availability

  • LUSTRE is deployed on Tungsten; NCSA's Linux ia32 cluster.
  • There are currently 104 OST's serving the various "sub-clusters" that comprise Tungsten(tuna, tunb, tunc, tund, tune).
  • Striping of large files across multiple OST's can be enable on a per-file or per-directory basis.
  • Good parallel performance can be obtained when multiple processes write individual files, or in striped mode, when accessing a single file concurrently.
  • More implementation details and the current status of LUSTRE on Tungsten can be found on the Tungsten File Systems Overview page.

SGI's Cluster CXFS File System

CXFS is SGI's latest shared file system based on XFS.

Features

  • As with GPFS and LUSTRE, CXFS appears to work exactly like a traditional UNIX file system.
  • Additional libraries are provided to invoke non-buffered direct I/O for very large memory intensive applications. (see man intro-ffio for details)
  • Metadata is handled by a centralized server.

Local Configuration and Availability

  • CXFS is deployed on Cobalt; NCSA's Linux ia64 SGI Altix system.
  • All mounted file systems on Cobalt are on CXFS including /home directories.
  • CXFS supports concurrent access to files from multiple compute nodes.
  • Both MPI and OpenMP applications can improve performance by parallelizing I/O on CXFS.

Tips on Using NCSA's Parallel File System

In general, performance will improve if I/O operations are performed in a directory that is mounted on a parallel file system. Changing to the appropriate scratch file system while running a job is the first step in improving I/O. For parallel applications, the goal is to have many nodes concurrently performing I/O operations, which will improve performance up to the point that the particular network and hardware configurations of the file system will permit.

If it is determined that a particular application is spending a significant amount of time performing I/O, dividing the load among more nodes on the cluster will increase bandwidth to the file system. Consider writing one file per process rather than allowing the I/O to be serialized. If done carefully, files can still be concatenated for portability on other systems. Routines can be written to support reading a set of files from a different number of processes than they were created with. MPI codes can take advantage of MPI-I/O routines that allow concurrent reading and writing to one file. These routines improve performance greatly over serial I/O on most of NCSA's file systems.

Parallel I/O performance is an active field of research. Ultimately, file system details and I/O optimization parameters will be transparently incorporated into most common scientific I/O libraries. Much work still needs to be done before this becomes a reality. Details such as buffer sizes and striping methods are inherently dependent on the underling hardware configurations and thus platform dependent.