- Timing
- Profiling
1. Timing codes on the Abe Linux Cluster
There are many ways to get timing information for a code. Methods
are
available for timing the overall execution of a code provided by the
operating system (time
and gprof) and by Performance
Tools (Perfsuite's psrun
for example). There are also several methods to time functions,
subroutines or
sections of code without instrumenting your code
(gprof and Perfsuite's psrun). This is
usually called Profiling the code.
Finally there are several ways to instrument your code with timers
provided by
the operating system (system calls) or provided by performance tools
like Perfsuite or
PAPI.
The use of Performance
Tools provide the additional functionality of using a metric other
than ticks or time, FLOPS for example, to profile the code.
See NCSA's Performance
Tools page
for more information on PAPI and other packages.
There are three issues that need to be considered when implementing
timing: implementation, performance and portability.
Typically the simpilest and fastest way to get timing information also
provides the coarsest information, although increasingly sophisticated
performance tools like Perfsuite
provide insightful detail with little or no additional work. The most
time consuming but most controllable is to instrument code by hand.
Most methods have little
impact on the overall performance of a code but testing with and
without timing or instrumentation is recommended.
If you use several computational facilities, the only things common to
most if not all platforms are the operating system timing routines.
As clock speeds increase beyond the Gigahertz, so to should one be
able to have timers that have nanosecond resolution. The following
table
shows approximate performance data on resolution and overhead for
various
timer functions. Most Performance tools like Perfsuite and PAPI
use the
native timers for highest resolution timing. System provided tools
like gprof and time have software governers that limit resolution.
To summarize the data in the table, for portability use the gettimeofday
function for wall clock. For better resolution use the native ASM
timer. For Fortran programmers who wish to take advantage of a Fortran
routine, use
SYSTEM_TIME with code modifications that take advantage of
Vendor extensions to the function.
Table. Summary of timing
routine resolution and overhead1.
|
routine
|
source
|
type
|
approximate
resolution
(microseconds)
|
approximate
overhead
(microseconds)
|
linux asm
|
Linux OS
|
wall
|
0.05
|
0.05
|
gettimeofday
|
OS
|
wall
|
1
|
1
|
clock_gettime2
|
OS
|
wall
|
0.1
|
0.1
|
SYSTEM_CLOCK3
|
Intel Fortran
|
wall
|
1
|
0.25
|
times4
|
OS
|
user, sys
|
10000
|
0.05
|
clock4
|
OS
|
cpu
|
10000
|
0.3
|
CPU_TIME
|
GNU/Intel Fortran
|
cpu
|
1000
|
0.4
|
getrusage4
|
OS
|
user, sys
|
1000
|
0.4
|
1. Values determined from runs on 2.33 GHz Intel64 processor
running Red Hat Enterprise Linux AS release 4 with the Intel 10.1 compiler.
2. Now provides better resolution
and overhead but use gettimeofday() for portability.
3. Resolution depends on size of integer type.
Use 8 byte integer for better resolution but this is an Intel extension
to the function and will break portability.
4. Resolution determined by definition of HZ in
param.h /usr/include/asm-arch/ |
Fortran users who use etime and dtime,
the resolution and overhead are the same as getrusage.
In case you are wondering about the definition of
the
types of time, here they are:
- user -- the amount of CPU time used by
the user's program
- sys (or system) -- the amount
of CPU time used by the system in support of the user's program
- cpu -- the total CPU time, i.e., user
+ sys
- wall -- the wall clock time, i.e.,
elapsed real time
Typically the
cpu time and the
wall clock time are the
same, unless there are other user processes running or there is
significant system usage as in excessive disk usage from i/o operations
or swapping/paging. On the NCSA Linux clusters, each node in allocated
to only one
user at a time, independent of the number of processors per node (ppn=1
or ppn=2).
For detailed information on the following:
1.1 time (/usr/bin/time)
The quickest way to get timing of a code is run the code within the
command: /usr/bin/time. The command will return user time, system time
and the total wall time. See the man page
on
time
to see more information on the command, especially on formatting the
output.
Note that the csh and tcsh shells have a built-in command also called
time.
% /usr/bin/time a.out
Use the -p option to use portability format.
1.2 gprof
A quick way to get more detailed information on functions and
routines
is to use the profile tool gprof.
The first step is to compile to source code with the compiler flags for
profiling. For the Intel compiler the flags are -p -g and
for the GNU compiler the flag is -pg. For the Intel
compiler the -g flag does not change the optimization
indicated by the presence of a -O flag. After compiling
the code, the second step is to execute the code which will then
generate a gmon.out file. To analyze the gmon.out file,
use
gprof. The results of the analyses will be dumped to
stdout. The flat profile will contain a useful breakdown of
time
spent in functions and subroutines. The call graph profile
contains inclusive and exclusive time spent in subroutines and
functions. See
the man pages on the Intel and GNU compilers for information about the
compiler flags for profiling and see the man page on
gprof
for its options.
% ifort -O -p -g foo.f # or gcc -O -pg foo.c
% ./a.out
% gprof --flat-profile a.out gmon.out
See the section on Profiling below for
more information about
using gprof.
Note: The Intel compiler will inline simple codes to an extent that for simple codes you may get the following warning from gprof:
gprof: gmon.out file is missing call-graph data
In this case, use the compiler flag -inline-level=0 which will lessen optimization but will prevent the compiler from flattening your code.
For even easier timing and profiling without re-compiling,
consider using psrun
from Perfsuite.
1.3 gettimeofday
When instrumenting a code with timing calls and if portability is a
primary concern, use the routine gettimeofday.
This routine provides wall clock time.
As usual see the man pages for particulars on usage. It offers
both microsecond resolution and overhead. It can be used as an elapsed
time as is shown in the following C code fragment.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and
protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
You can also make a C function that can be called from Fortran:
#include <stddef.h> /* defines NULL */
#include <sys/time.h>
double second_() /* compilers like AIX xlf do not require
the trailing '_' */
{
struct timeval tp;
int rtn;
rtn=gettimeofday(&tp, NULL);
return ((double)tp.tv_sec+(1.e-6)*tp.tv_usec);
}
1.4 SYSTEM_CLOCK
To use the Fortran SYSTEM_CLOCK() subroutine with the Intel extension of 8 byte variables,
the following example can be used.
double precision function second()
integer(8) C,R,M
CALL SYSTEM_CLOCK (COUNT=C, COUNT_RATE=R, COUNT_MAX=M)
second = dble(C)/dble(R)
return
end
The difference in successive calls to second() will give elapsed time in seconds. For more information
please see the Intel refence page for SYSTEM_CLOCK
1.5 Linux Assembly code
When instrumenting a code with timing calls and if high resolution is
important , not portability,
use the native Linux ASM timer. This routine provide wall
clock time. You can use either the Intel or GNU compiler.
unsigned long long int cycles_x86_64(void)
{
unsigned long long int val;
do {
unsigned int a,d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
(ret) = ((long long)a) | (((long long)d)<<32);
} while(0);
return(val);
}
You can link to the resulting object file with
either Intel or GNU compiler from C or Fortran with the appropriate
wrapper
if needed. Below is an example of using the above routine as a timer.
#define CPS 2327505000;
static double iCPS;
static unsigned start=0;
double second(void) /* Include an '_' if you will be calling from
Fortan */
{
double foo;
if (!start)
{
iCPS=1.0/(double)CPS;
start=1;
}
foo=iCPS*cycles_x86_64();
return(foo);
}
2. Profiling
2.1 gprof
A quick way to get more detailed information on functions and routines
is to use the profile tool gprof.
The first step is to compile to source code with the compiler flags for
profiling. For the Intel compiler the flags are -p -g and
for the GNU compiler the flag is
-pg. For the Intel compiler the '-g' flag does not change
the optimization indicated by the presence (if any) of the '-O' flag.
After compiling the code, the second step is to execute the code which
will then generate a gmon.out file. To analyze the gmon.out file, use gprof.
The results of the analyses will be dumped to stdout.
% ifort -O -p -g foo.f # or gcc -O -pg foo.c
% ./a.out
% gprof a.out gmon.out
The 'flat' profile will contain a useful breakdown of time
spent in functions and subroutines. The 'call graph' profile
contains
inclusive and exclusive time spent in subroutines and functions. See
the man pages on the Intel and GNU compilers for information about the
compiler flags for profiling and see the man page on gprof
for its options.
An undocumented GMON environment variable is GMON_OUT_PREFIX.
When profiling
a threaded or MPI code, each process will generate a gmon file called $GMON_OUT_PREFIX.pid.
Each gmon file can then be analyzed seperately or the aggregate sum can
be produced by gmon
and examined as a whole:
% gprof -s $GMON_OUT_PREFIX.*
% gprof foo gmon.sum
Note: The Intel compiler will inline simple codes to an extent that for simple codes you may get the following warning from gprof:
gprof: gmon.out file is missing call-graph data
In this case, use the compiler flag -inline-level=0 which will lessen optimization but will prevent the compiler from flattening your code.
2.2 Perfsuite
The Perfsuite
performance suite provides a profiling tool called psrun.
It is available on NCSA Linux clusters in /usr/apps/tools/perfsuite
and provides enhanced functionality of the timing and
profiling tools mentioned above.
The simpliest way to use psrun is with an existing executable:
% setenv PS_HWPC_TIME 0 # collect xml stats file for any successful exit
% psrun ./foo
% psprocess foo*.xml
See the documentation for psprocess for
information on analyzing the XML files generated by psrun.
Performance Engineering and Computational Methods Group
(PECM)
High End Computing Division