Table of Contents
- Data Transfer Overview
Data Transfer Protocols
Data Transfer Clients
NCSA-TeraGrid Data Transfer Resources
NCSA Mass Storage System Transfers
Data Transfer Software Installation
Transfer Performance Considerations
Data Transfer Examples
Data Transfer Overview
The material in this section on data transfer has been gathered and
organized to provide useful and practical information that will help
you move your data from one storage resource to another. Here you will
find information about data transfer protocols and the clients that are
installed on NCSA resources. You will also learn about transferring data
to and from TeraGrid, to and from NCSA's Mass Storage System (MSS), and
to and from external, remote locations. We also provide software recommendations,
guidance for installing the various software, transfer performance considerations,
and information about third-party software. Additionally, this documentation
is supported by a searchable database of working command-line data transfer
examples that will give you a better understanding of the possible transfer
methods that will facilitate your computational requirements.
Data Transfer to Facilitate Computation
The reason for computing at a site such as NCSA is traditionally rooted
in a need for computational power. The data requirements of computationally
intensive projects vary. In the most distilled form, a computer program
would only require one bit of storage (for output) in order to accommodate
a yes/no answer. In practice however, large data sets often accompany
large computations. Accommodating and managing these data sets often
require advanced tools that leverage all of the not-so-glamorous system
capabilities — not associated with computational cycles — that
make practical large-scale computation possible.
Storage locations may include:
- High-performance parallel file system, often referred to as a scratch file
system
- Local disk drive of a commodity computer or workstation
- Database, which implies some sort of file system storage, SQL being
the interface to the binary file format (The performance characteristics
of the file system of course depend on the underlying hardware.)
- Long-term archival storage system such as the NCSA Mass Storage
System (MSS); these systems often employ a tape archive and disk cache
retrieval mechanism. Once data is staged to disk, it is then
available for transfer.
The figure below depicts a simple "road map" overview of common transfer
scenarios within NCSA and to or from TeraGrid.

Figure. A "road map" view of common data
transfer routes.
Transferring all of these data takes some forethought and planning.
How data is moved and what the expected transmission rates are depend
on how it is stored and what its intended use is. To facilitate large
computational runs, data may need to be moved in the following ways:
- Move data from a production run off of a file system by:
- Transferring it to NCSA's MSS
- Transferring it from an NCSA cluster to an offsite location.
- Transferring it between NCSA clusters
- Stage data onto a locally accessible storage area for a job by moving
input data onto a scratch parallel file system prior to job execution.
Depending on the amount of data and where it is stored this could be
a time-consuming endeavor. Insuring job execution before input data
are purged may require a special project allocation (File
System Allocation Request) on the scratch file system.
- Transfer data to or from NCSA and remote sites by following the guidelines
in these scenarios:
- Cluster to Cluster — Generally speaking this type of transfer
has the best performance potential. High-performance parallel file
system and multi-host transfer servers (GridFTP) coupled with dedicated
Wide Area Network (WAN) connections between sites offer the best
transfer performance.
- Cluster to/from Archive
- Current archive implementations still require a tape staging
time that must be accounted for.
- Staging large amounts of input data in a large parallel batch
job is not recommended.
- Cluster to/from Workstation — Network and firewall issues
(external to NCSA) may pose problems for these types of transfers.
- Archive to/from Workstation
Performance Note
The ultimate throughput of any file transfer is limited by the weakest
link in the chain. Identifying the bottleneck can sometimes be the most
difficult exercise. If the endpoint of interest is a location external
to TeraGrid, working with your site administrator and network technicians
may be the only way to overcome performance limitations. For this reason,
the examples in this document are primarily focused on transfer scenarios
that involve an endpoint either within NCSA or TeraGrid. Similar techniques
can be applied to external endpoints, but no amount of effort (installing
software, servers, etc.) can overcome a single network bottleneck or
confounding firewall rule.
Check with the network administrator of your local site for connectivity
details and possible firewall and/or network bottlenecks than can lead
to unexpected or inconsistent network bandwidth or functionality. Transfers
can only take place as fast as the slowest component in the network chain.