NCSA Home
Contact Us | Intranet | Search

Data Transfer

Table of Contents

  1. Data Transfer Overview
    Data Transfer Protocols
    Data Transfer Clients
    NCSA-TeraGrid Data Transfer Resources
    NCSA Mass Storage System Transfers
    Data Transfer Software Installation
    Transfer Performance Considerations
    Data Transfer Examples

Data Transfer Overview

The material in this section on data transfer has been gathered and organized to provide useful and practical information that will help you move your data from one storage resource to another. Here you will find information about data transfer protocols and the clients that are installed on NCSA resources. You will also learn about transferring data to and from TeraGrid, to and from NCSA's Mass Storage System (MSS), and to and from external, remote locations. We also provide software recommendations, guidance for installing the various software, transfer performance considerations, and information about third-party software. Additionally, this documentation is supported by a searchable database of working command-line data transfer examples that will give you a better understanding of the possible transfer methods that will facilitate your computational requirements.

Data Transfer to Facilitate Computation

The reason for computing at a site such as NCSA is traditionally rooted in a need for computational power. The data requirements of computationally intensive projects vary. In the most distilled form, a computer program would only require one bit of storage (for output) in order to accommodate a yes/no answer. In practice however, large data sets often accompany large computations. Accommodating and managing these data sets often require advanced tools that leverage all of the not-so-glamorous system capabilities — not associated with computational cycles — that make practical large-scale computation possible.

Storage locations may include:

  • High-performance parallel file system, often referred to as a scratch file system
  • Local disk drive of a commodity computer or workstation
  • Database, which implies some sort of file system storage, SQL being the interface to the binary file format (The performance characteristics of the file system of course depend on the underlying hardware.)
  • Long-term archival storage system such as the NCSA Mass Storage System (MSS); these systems often employ a tape archive and disk cache retrieval mechanism. Once data is staged to disk, it is then available for transfer.

The figure below depicts a simple "road map" overview of common transfer scenarios within NCSA and to or from TeraGrid.

Figure. A "road map" view of common data transfer routes.

Transferring all of these data takes some forethought and planning. How data is moved and what the expected transmission rates are depend on how it is stored and what its intended use is. To facilitate large computational runs, data may need to be moved in the following ways:

  • Move data from a production run off of a file system by:
    • Transferring it to NCSA's MSS
    • Transferring it from an NCSA cluster to an offsite location.
    • Transferring it between NCSA clusters
  • Stage data onto a locally accessible storage area for a job by moving input data onto a scratch parallel file system prior to job execution. Depending on the amount of data and where it is stored this could be a time-consuming endeavor. Insuring job execution before input data are purged may require a special project allocation (File System Allocation Request) on the scratch file system.
  • Transfer data to or from NCSA and remote sites by following the guidelines in these scenarios:
    • Cluster to Cluster — Generally speaking this type of transfer has the best performance potential. High-performance parallel file system and multi-host transfer servers (GridFTP) coupled with dedicated Wide Area Network (WAN) connections between sites offer the best transfer performance.
    • Cluster to/from Archive
      • Current archive implementations still require a tape staging time that must be accounted for.
      • Staging large amounts of input data in a large parallel batch job is not recommended.
    • Cluster to/from Workstation — Network and firewall issues (external to NCSA) may pose problems for these types of transfers.
    • Archive to/from Workstation

Performance Note

The ultimate throughput of any file transfer is limited by the weakest link in the chain. Identifying the bottleneck can sometimes be the most difficult exercise. If the endpoint of interest is a location external to TeraGrid, working with your site administrator and network technicians may be the only way to overcome performance limitations. For this reason, the examples in this document are primarily focused on transfer scenarios that involve an endpoint either within NCSA or TeraGrid. Similar techniques can be applied to external endpoints, but no amount of effort (installing software, servers, etc.) can overcome a single network bottleneck or confounding firewall rule.

Check with the network administrator of your local site for connectivity details and possible firewall and/or network bottlenecks than can lead to unexpected or inconsistent network bandwidth or functionality. Transfers can only take place as fast as the slowest component in the network chain.