NCSA Home
Contact Us | Intranet | Search

Timing and Profiling on Abe, an Overview

  1. Timing
  2. Profiling

1. Timing codes on the Abe Linux Cluster

There are many ways to get timing information for a code. Methods are available for timing the overall execution of a code provided by the operating system (time and gprof) and by Performance Tools (Perfsuite's psrun for example). There are also several methods to time functions, subroutines or sections of code without instrumenting your code (gprof and Perfsuite's psrun). This is usually called Profiling the code. Finally there are several ways to instrument your code with timers provided by the operating system (system calls) or provided by performance tools like Perfsuite or PAPI. The use of Performance Tools provide the additional functionality of using a metric other than ticks or time, FLOPS for example, to profile the code.

See NCSA's Performance Tools page for more information on PAPI and other packages.

There are three issues that need to be considered when implementing timing: implementation, performance and portability. Typically the simpilest and fastest way to get timing information also provides the coarsest information, although increasingly sophisticated performance tools like Perfsuite provide insightful detail with little or no additional work. The most time consuming but most controllable is to instrument code by hand. Most methods have little impact on the overall performance of a code but testing with and without timing or instrumentation is recommended. If you use several computational facilities, the only things common to most if not all platforms are the operating system timing routines.

As clock speeds increase beyond the Gigahertz, so to should one be able to have timers that have nanosecond resolution. The following table shows approximate performance data on resolution and overhead for various timer functions. Most Performance tools like Perfsuite and PAPI use the native timers for highest resolution timing. System provided tools like gprof and time have software governers that limit resolution.

To summarize the data in the table, for portability use the gettimeofday function for wall clock. For better resolution use the native ASM timer. For Fortran programmers who wish to take advantage of a Fortran routine, use SYSTEM_TIME with code modifications that take advantage of Vendor extensions to the function.


Table. Summary of timing routine resolution and overhead1.
routine
source
type
approximate
resolution
(microseconds)
approximate
overhead
(microseconds)

linux asm
Linux OS
wall
0.05
0.05
gettimeofday
OS
wall
1
1
clock_gettime2
OS
wall
0.1
0.1
SYSTEM_CLOCK3
Intel Fortran
wall
1
0.25
times4
OS
user, sys
10000
0.05
clock4
OS
cpu
10000
0.3
CPU_TIME
GNU/Intel Fortran
cpu
1000
0.4
getrusage4
OS
user, sys
1000
0.4
1. Values determined from runs on 2.33 GHz Intel64 processor running Red Hat Enterprise Linux AS release 4 with the Intel 10.1 compiler.
2. Now provides better resolution and overhead but use gettimeofday() for portability.
3. Resolution depends on size of integer type. Use 8 byte integer for better resolution but this is an Intel extension to the function and will break portability.
4. Resolution determined by definition of HZ in param.h /usr/include/asm-arch/

Fortran users who use etime and dtime, the resolution and overhead are the same as getrusage.


In case you are wondering about the definition of the types of time, here they are:
  • user -- the amount of CPU time used by the user's program
  • sys (or system) -- the amount of CPU time used by the system in support of the user's program
  • cpu -- the total CPU time, i.e., user + sys
  • wall -- the wall clock time, i.e., elapsed real time
Typically the cpu time and the wall clock time are the same, unless there are other user processes running or there is significant system usage as in excessive disk usage from i/o operations or swapping/paging. On the NCSA Linux clusters, each node in allocated to only one user at a time, independent of the number of processors per node (ppn=1 or ppn=2).

For detailed information on the following:


1.1 time (/usr/bin/time)

The quickest way to get timing of a code is run the code within the command: /usr/bin/time. The command will return user time, system time and the total wall time. See the man page on time to see more information on the command, especially on formatting the output. Note that the csh and tcsh shells have a built-in command also called time.

% /usr/bin/time a.out

Use the -p option to use portability format.

1.2 gprof

A quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile to source code with the compiler flags for profiling. For the Intel compiler the flags are -p -g and for the GNU compiler the flag is -pg. For the Intel compiler the -g flag does not change the optimization indicated by the presence of a -O flag. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof. The results of the analyses will be dumped to stdout. The flat profile will contain a useful breakdown of time spent in functions and subroutines. The call graph profile contains inclusive and exclusive time spent in subroutines and functions. See the man pages on the Intel and GNU compilers for information about the compiler flags for profiling and see the man page on gprof for its options.

% ifort -O -p -g foo.f # or gcc -O -pg foo.c
% ./a.out
% gprof --flat-profile a.out gmon.out

See the section on Profiling below for more information about using gprof.

Note: The Intel compiler will inline simple codes to an extent that for simple codes you may get the following warning from gprof:
gprof: gmon.out file is missing call-graph data
In this case, use the compiler flag -inline-level=0 which will lessen optimization but will prevent the compiler from flattening your code.

For even easier timing and profiling without re-compiling, consider using psrun from Perfsuite.

1.3 gettimeofday

When instrumenting a code with timing calls and if portability is a primary concern, use the routine gettimeofday. This routine provides wall clock time. As usual see the man pages for particulars on usage. It offers both microsecond resolution and overhead. It can be used as an elapsed time as is shown in the following C code fragment.

#include <stddef.h>     /* definition of NULL */
#include <sys/time.h>  /* definition of timeval struct and protyping of gettimeofday */

    double t1,t2,elapsed;
    struct timeval tp;
    int rtn;

    ....
    ....
    rtn=gettimeofday(&tp, NULL);
    t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
    ....
    /* do some work */
    ....
    rtn=gettimeofday(&tp, NULL);
    t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
    elapsed=t2-t1;

You can also make a C function that can be called from Fortran:

#include <stddef.h> /* defines NULL */
#include <sys/time.h>

double second_()   /* compilers like AIX xlf do not require the trailing '_' */
{
    struct timeval tp;
    int rtn;
    rtn=gettimeofday(&tp, NULL);

    return ((double)tp.tv_sec+(1.e-6)*tp.tv_usec);
}

1.4 SYSTEM_CLOCK

To use the Fortran SYSTEM_CLOCK() subroutine with the Intel extension of 8 byte variables, the following example can be used.

  double precision function second()
  integer(8) C,R,M
    CALL SYSTEM_CLOCK (COUNT=C, COUNT_RATE=R, COUNT_MAX=M)
    second = dble(C)/dble(R)
    return
  end

The difference in successive calls to second() will give elapsed time in seconds. For more information please see the Intel refence page for SYSTEM_CLOCK

1.5 Linux Assembly code

When instrumenting a code with timing calls and if high resolution is important , not portability, use the native Linux ASM timer. This routine provide wall clock time. You can use either the Intel or GNU compiler.

unsigned long long int cycles_x86_64(void)
{
  unsigned long long int val;
  do {
     unsigned int a,d;
     asm volatile("rdtsc" : "=a" (a), "=d" (d));
     (ret) = ((long long)a) | (((long long)d)<<32);
  } while(0);
  return(val);
}

You can link to the resulting object file with either Intel or GNU compiler from C or Fortran with the appropriate wrapper if needed. Below is an example of using the above routine as a timer.

#define CPS 2327505000;
static double iCPS;
static unsigned start=0;

double second(void) /* Include an '_' if you will be calling from Fortan */
{
  double foo;
  if (!start)
  {
     iCPS=1.0/(double)CPS;
     start=1;
  }
  foo=iCPS*cycles_x86_64();
  return(foo);
}

2. Profiling

2.1 gprof

A quick way to get more detailed information on functions and routines is to use the profile tool gprof. The first step is to compile to source code with the compiler flags for profiling. For the Intel compiler the flags are -p -g and for the GNU compiler the flag is -pg. For the Intel compiler the '-g' flag does not change the optimization indicated by the presence (if any) of the '-O' flag. After compiling the code, the second step is to execute the code which will then generate a gmon.out file. To analyze the gmon.out file, use gprof. The results of the analyses will be dumped to stdout.

% ifort -O -p -g foo.f # or gcc -O -pg foo.c
% ./a.out
% gprof a.out gmon.out

The 'flat' profile will contain a useful breakdown of time spent in functions and subroutines. The 'call graph'  profile contains inclusive and exclusive time spent in subroutines and functions. See the man pages on the Intel and GNU compilers for information about the compiler flags for profiling and see the man page on gprof for its options.

An undocumented GMON environment variable is GMON_OUT_PREFIX. When profiling a threaded or MPI code, each process will generate a gmon file called $GMON_OUT_PREFIX.pid. Each gmon file can then be analyzed seperately or the aggregate sum can be produced by gmon and examined as a whole:

% gprof -s $GMON_OUT_PREFIX.*
% gprof foo gmon.sum

Note: The Intel compiler will inline simple codes to an extent that for simple codes you may get the following warning from gprof:
gprof: gmon.out file is missing call-graph data
In this case, use the compiler flag -inline-level=0 which will lessen optimization but will prevent the compiler from flattening your code.

2.2 Perfsuite

The Perfsuite performance suite provides a profiling tool called psrun. It is available on NCSA Linux clusters in /usr/apps/tools/perfsuite and provides enhanced functionality of the timing and profiling tools mentioned above.

The simpliest way to use psrun is with an existing executable:

% setenv PS_HWPC_TIME 0 # collect xml stats file for any successful exit
% psrun ./foo
% psprocess foo*.xml

See the documentation for psprocess for information on analyzing the XML files generated by psrun.

Performance Engineering and Computational Methods Group (PECM)
High End Computing Division