IBM Compiler Optimization Flags
Introduction
IBM Fortran, C, and C++ compiles are done without any optimization by default. Any level of optimization done by the compiler must be explicitly specified by means of flags to the compiler at compile and link time. This is different from the situation on the previous Cray platforms at NERSC, whose compilers provided a fairly high level of optimization by default, and it was necessary to ask explicitly for an unoptimized compile if that was what was wanted.
For the most part, the IBM Fortran, C, and C++ compilers all have the same optimization arguments. The description of these arguments below applies to all three compilers unless otherwise stated.
These are the currently recommended optimization options for compiling on the machine on which you will be running the code:
-O3 -qstrict -qarch=auto-qtune=auto
These options provide a compromise between minimizing compilation time and maximizing the performance of the compiled code. All of these options as well as several other useful optimization options will be described below.
Some specific examples of the compile time and run time impact of several different sets of optimization arguments on several public benchmarks are given at Compiler Optimization Argument Examples.
-On
The compilers allow you to specify a general level of optimization by specifying a numeric optimization level with the -O flag. The higher the number the greater the amount of optimization the compiler does, the longer the compile takes, and the more memory the compile uses. The lowest numeric optimization is -O2. There are no -O0 nor -O1 optimization arguments currently supported by the compilers.
-O2 (-O)
The -O2 option is designed to provide an intermediate level of optimization that does not require an excessive amount of time to perform the compile and will produce numeric results identical to those produced by an unoptimized compile. It avoids certain types of optimizations that have the potential to produce different numeric results. See the section on the -qstrict argument below for a discussion on how the exact equality of numeric results is accomplished.
The -O option is identical to the -O2 option.
The optimizations done at the -O2 level include:
- Value numbering - folding several instructions into a single instruction.
- Branch straightening - rearranging program code to minimize branch logic and combining physically separate blocks of code.
- Common expression elimination - eliminating duplicate expressions.
- Code motion - performing calculations outside a loop (if the variables in the loop are not altered within it) and using those results within the loop.
- Reassociation and strength reduction - rearranging calculation sequences in loops in order to replace less efficient instructions with more efficient ones.
- Global constant propagation - combining constants used in an expression and generating new ones.
- Store motion - moving store operations out of loops.
- Dead store elimination - eliminating stores when the value stored is never referred to again.
- Dead code elimination - eliminating code for calculations that are not required and portions of the code that can never be reached.
- Global register allocation - keeping variables and expressions in registers instead of memory.
- Instruction scheduling - reordering instructions to minimize program execution time.
This is a dot product example of the store motion optimization done at the -O2 level.
Fortran
x=0.0 do i=1,ilim x=a(i)*b(i)+x enddo
C
x = 0.0;
for ( i=0 ; i < ilim ; i++ )
{
x+= a[i] * b[i] ;
}
The unoptimized, default compile follows all the source code instructions literally. In this case, for each iteration of the loop, there would be a new load and a new store of the variable x. With -O2 optimization, the compiler would recognize that there is no need to store the value of x until the loop is completed, and intermediate values would be kept in registers. Even if the loads and stores are cached, the optimization could lead to an order of magnitude or better improvement in the performance of this loop.
-O3
The -O3 level of optimization peforms all of the optimizations done at the -O2 level as well as several other optimizations that require more memory or time to accomplish.
Some optimizations may be done that will change the semantics of the program slightly, and might cause numeric differences between the results of the program and the same program compiled at the -O2 optimization level or with no optimization. To disable those optimizations that might produce different results, include the -qstrict option on the compile line after -O3 is specificied.
These are the types of optimizations done at this level that are not done at the -O2 level:
-
Rewriting floating-point expressions:
Computations such as a*b*c may be rewritten as a*c*b if, for example, an opportunity exists to get a common subexpression by the rearrangement. This is not done at the -O2 level, since it may give different numeric results.
In addition, divides are replaced by multiplies by the reciprocal at this level. Divides on the POWER3, as on most processors, are very expensive operations. They require 14 cycles for 32 bit floating point and 18 cycles for 64 bit floating point operands. The floating-point reciprocal estimate function and multiply are much cheaper operations, at the cost of potentially different numeric results.
-
Aggressive code motion and scheduling:
At this optimization level the compiler will rearrange the code and instruction sequence much more aggressively. In particular, computations that have the potential to raise an exception whose execution is conditional in the program might be definitely scheduled at this level if this might lead to improvements in performance. In other words, load and floating-point computations may be placed onto execution paths where they will be executed even though, according to the actual semantics of the program, they might not have been.
Loop-invariant floating-point computations that are found on some, but not all, paths through a loop will not be moved at -O2 because the computations may cause an exception. At -O3, the compiler will move the computations if the move is not certain to cause an exception.
The same principle is followed when it comes to moving many kinds of loads. Although a load by means of a pointer will never be moved, at the -O3 optimization level the compile will move other types of loads for a potential performance improvement. Loads in general are not movable at the -O2 level of optimization because a program can declare a static array and then load to an element of that array far beyond the declared boundary which might cause a segmentation violation.
The same principles is followed when it comes to scheduling instructions as this example shows.
Example: In the following example, at the -O2 optimization level the computation of b+c is not moved out of the loop for two reasons: it is considered dangerous because it is a floating-point operation and could thus possibly cause an exception and it does not occur on every path through the loop, so it potentially may never be executed. For this reason, at -O2 the loop invariant b+c computation may be performed many times based on the values of the elements of the array a. At -O3, the computation is moved outside the loop and done only once.
Fortran
do i=1,ilim-1 if(a(i).lt.a(i+1)) a(i)=b+c enddo
C
for (i = 0 ; i < ilim-2 ; i++) { if (a[i] < a[i+1]) a[i] = b + c ; } -
Incorrect sign for zero:
-O3 also will do some optimizations not performed at -O2 because they may produce an incorrect sign in cases with a zero result. For example, the expression "x + 0.0" would not be replaced with "x" at the -O2 level. A redundant add of 0.0 to x will be done because x might be equal to -0.0 and, under IEEE rules, -0.0 + 0.0 = 0.0 which would be -x in this case. Since, in the overwhelming majority of cases, this has no significant impact on a program's results, -O3 will substitute "x" for "x + 0.0".
Some limitations of the -O3 level are:
- It does not do any processor specific optimizations like those done by the -qarch or -qtune options.
- It does not optimize complex loops or those with 3 or more loop indices very well. See the discussion under -qhot.
- Integer divide instructions are not optimized.
-O4
The -O4 level of optimization peforms all of the optimizations done at the -O3 level as well as several other optimizations. This argument is equivalent to:
This flag should be specified both at compile and link time.
-O5
The -O5 level of optimization peforms all of the optimizations done at the -O4 as well as the optimization specified by -qipa=level=2 which is described below. This argument is equivalent to:
This flag should be specified both at compile and link time.
-qstrict
The -qstrict option ensures that any optimization done will not alter the semantics of a program, and that the numeric results of a program will be identical with those produced by an unoptimized program. This option actually limits the amount of optimization done when it is included with any other optimization argument.
Specifically, optimizations that perform operations like these are not done:
- Loads and floating-point computations that may trigger an exception.
- Relaxing conformance to IEEE rules.
- Reassociating floating-point expressions.
With this option, a strict computational order is observed based on the language rules for operator precedence and left to right operation regardless of the potential negative effect on performance.
The following example of the potentially inhibiting effects of this argument on optimization is taken from the IBM document "Power3 Introduction and Tuning Guide", SG24-5155-00, which is available at the IBM Documentation website.
When evaluating an expression like this:
A*B*C + B*C*D
an optimized compile would recognize that B*C is a "common sub-expression" and evaluate it only once. However, if -qstrict is specified, this optimization would not be done, since that would violate the left to right ordering rule on A*B*C. Since floating point arithmetic is not associative there is no guarantee the results of (A*B)*C would be bitwise identical to those of A*(B*C).
In practice, we have observed that compiling with this argument rarely has a negative impact on the performance of a code. It can even be the case a code compiled with -qstrict is actually faster at run time than it was when compiled without this argument.
-qhot
The -qhot option is by far the most expensive option in terms of adding to the elapsed time of a compile. Adding this option often more than triples the compile time of a code.
The -qhot option performs several loop oriented optimizations:
- restructures loops with 3 or more loop indices to traverse the loop more efficiently.
- pads array dimensions and data objects to minimize cache misses. (Arrays with dimensions that are a power of two are particularly susceptible to cache misses).
- converts certain operations that are done on successive elements of an array (e.g. square root) into a "vector" operation that calculates several results at once at a much faster rate than calculating them sequentially.
- performs these optimization transformations on loops: scalar replacement, loop blocking, distribution, fusion, interchange, reversal, skewing, and unrolling.
- reduces the generation of temporary arrays.
The Fortran option -qreport=hotlist will produce a listing describing all of the transformations done by -qhot. The listing is in a file called program.lst where program.f is the name of the source code file that was compiled.
It is possible for the -qhot option to decrease the performance of a program if the compiler does not have enough information about loop bounds and array dimensions, so that it attempts inappropriate optimizations.
-qipa
The -qipa option enables interprocedural analysis (IPA) by the compiler. This enables the compiler to identify optimization opportunities across procedural boundaries. It does this by extending the area that is examined during optimization and inlining from a single procedure to multiple procedures (possibly in different source files) and the linkage between them. This option should be included in both the compile and link phases.
There are a rich collection of suboptions of the form -qipa=suboption. See the compiler man pages for more details about these suboptions. Some useful options are the -qipa=level=n options where n can be 0, 1, or 2. These determine the amount of interprocedural analysis and optimization that is performed. As with the -On options, the higher the number, the greater the amount of optimization performed.
-qipa=level=0
This does only minimal interprocedural analysis and optimization.
-qipa=level=1 (-qipa)
This is the default level for ipa. The two options -qipa=level=1 and -qipa are identical.
This level turns on inlining, limited alias analysis, and limited call-site tailoring.
Inlining a procedure causes a procedure call to be replaced by the procedure itself to eliminate the overhead of the call.
An alias occurs when different variables in a program refer to the same area of storage. If the compiler is unsure whether a given global variable is aliased, it will assume that every procedure call might cause the variable to be read or changed. For this reason it will generate extra loads and stores to preserved the value of the variable when the procedure is called instead of storing it in a register.
Call-site tailoring is IBM's generic term for optimizations that are performed on a function-call basis like cloning and inlining.
-qipa=level=2
This ipa level performs full interprocedural data flow and alias analysis.
-qarch
The -qarch argument specifies the type of processor on which the executable code will be run, and produces an executable program that contains machine instructions specific to that processor. This allows the compiler to take advantage of processor-specific instructions that can improve performance at the cost of producing an executable program that will run on only one type of processor. The default for this argument is -qarch=com, which will produce a program that is runnable on any POWER or POWERPC processor.
The recommended value for this argument is
-qarch=auto
The -qarch=auto option tells the compiler to produce a program with machine instructions specific to the processor on which it is compiled.
-qtune
The -qtune argument specifies the type of processor for which the program should be tuned to produce the best performance. Tuning for a processor involves instruction selection, scheduling, taking advantage of cache sizes and setting up pipelining to take advantage of the specified processor's hardware. Unlike the -qarch option, this option does not produce processor specific code. A code that is compiled with a -qtune argument designated for one type of processor will run correctly on any other POWER or POWERPC processor, although its performance may be worse than it would be if it were compiled with the appropriate -qtune argument.
The recommended value for this argument is
-qtune=auto
The -qtune=auto option tells the compiler to produce a program tuned for the processor on which it is compiled.
-qcache
The -qcache argument specifies the cache configuration of the processor on which the program will be run. The argument must be used in combination with the -O4, -O5, or -qipa options with a C or C++ compile and in combination with the -qhot option with a Fortran compile. The compiler uses the information provided by this argument to determine how loop operations can be structured or blocked to process only the amount of data that can fit into the data cache. The default value for this argument is the same as that of the -qtune option.
As with -qtune, a code compiled with this argument designated for one type of processor will run correctly on any other POWER or POWERPC processor, although its performance may be worse than it would be if it were compiled with the appropriate -qcache argument.
This option is designed for those cases in which the cache on the processor on which the code will be run is different from the standard cache for that processor.
If this argument is used in a compile, NERSC recommends it be given this value:
-qcache=auto
The -qcache=auto option tells the compiler to produce a program for the cache configuration of the processor on which it is compiled.
-lessl
The Engineering and Scientific Subroutine Library (ESSL) is a collection of high performance numerical routines. These routines are very highly optimized for the POWER3 architecture, and using them will almost always produce the fastest possible code for any given numerical algorithm.
If the -lessl argument is specified at link time, single threaded versions of these routines will be used from the ESSL serial library.
-lesslsmp
Some of these subroutines are also available in multithreaded versions that will run in parallel over a node using the shared memory parallel processing programming model. If the -lesslsmp argument is used at link time, these versions will be loaded if they are available, and by default these routines will be run with sixteen user threads. The number of user threads that the routine will use can be controlled by the OMP_NUM_THREADS in the manner described at Changing the number of threads and tasks. The routine will run with sixteen user threads if this variable is not set.
-qessl
The -qessl option for Fortran compiles directs the compiler to substitute the much faster routines from the Engineering and Scientific Subroutine Library (ESSL) for their equivalent Fortran 90 intrinsic procedures when it is safe to do so. Both 32 and 64 bit datatypes are supported. In addition to -qessl, at least one of these other options must be specified for this optimization to take effect: -qsmp, -qipa , -qhot, -O3, -O4, or -O5.
In addition to -qessl, either -lessl or -lesslsmp must be specified at link time. If -lessl is specified the single threaded versions from the ESSL library will be used. If -lesslsmp is specified, the multi-threaded versions will be used. Codes linked with -lesslsmp will be run with OMP_NUM_THREADS user threads or with one user threads per processor if this environment variable has not been set.
-qsmp
The -qsmp option, which is equivalent to -qsmp=auto, tells the compiler to attempt to parallelize the user code. It will do this by attempting to parallelize explicitly coded loops, and, with Fortran, loops that are generated by the compiler for array language. This option must be used with a "thread safe" version of the compiler, one with a "_r" suffix, e.g. xlf90_r, mppCC_r, xlc_r, etc.
Codes compiled with this option will be run with OMP_NUM_THREADS user threads or with one user thread per processor if this environment variable has not been set.
IBM Compiler Optimization Argument Examples
This describes the compilation and run time impact on several different publicly available benchmarks of a variety of compiler optimization arguments.
Introduction
Publicly available benchmarks are compiled and run with several different sets of optimization options and the performance recorded. The time required to compile and link the code is also recorded, and the results summarized.
The following information is given for each benchmark:
- The source of the benchmark, the source code changes required to enable it to compile and run on the SP, and the way the compiler is invoked to produce the executable, or, if a makefile is used, the changes to the makefile required for the code to run on the SP.
- The elapsed time for a single threaded compile and link from source code for the given set of optimization arguments.
- Either an internal time measurement like MFLOPs, if provided by the benchmark, or, if there is no internal timer, the elapsed time for a (possibly multi-threaded) run of the code on a dedicated system with the given set of options or the GFlops figure returned by the poe+ program.
The numbers given in the tables below for the individual benchmarks are the best of several dedicated runs in batch mode.
Linpack
The Linpack benchmark solves a dense set of linear equations. The version tested here is the 1000x1000 double precision version obtained from 1000d. It is contained in a single 755 line source code file containing 11 subroutines and functions in addition to the main driver. It is a simple Fortran 77 code originally written in 1978 and last modified in 1992.
Source Code Changes
All references to the second() in the source were replace with references to rtc().
Compile Changes
The code was compiled with the xlf compiler with no options beyond the optimization options.
Times
These runs were done with the 8.1.1.3 version of xlf in December, 2003.
The Compile Time is the wall-clock time for the compile and link returned by the unix time command.
The MFLOPS result is that returned by the internal timer in the code.
Results
| Optimization | Compile | MFLOPS |
|---|---|---|
| unoptimized | .52 | 49 |
| -O2 | 1.36 | 243 |
| -O3 -qstrict -qarch=pwr3 -qtune=pwr3 | 1.98 | 256 |
| -O3 -qarch=pwr3 -qtune=pwr3 | 2.07 | 257 |
| -O3 -qhot -qarch=pwr3 -qtune=pwr3 | 3.75 | 256 |
| -O4 -qnohot | 4.08 | 264 |
| -O4 | 6.56 | 264 |
| -O5 -qnohot | 4.19 | 266 |
| -O5 | 5.96 | 263 |
Comments
There was no significant difference in the code's performance at any optimization level when the threaded compiler (xlf_r) was used or when the mass library was included. In this case, the recommended optimization options, -O3 -qstrict -qarch=pwr3 -qtune=pwr3 compilation give close to the best performance obtainable by any other optimization options.
This example exhibits the limitations of the use of compiler options alone to improve a code's performance, since the best performance obtained on a single processor in this example attains only 18% of the processor's theoretically peak performance of 1.5 GFlops.
Most of the work in this example is done by four BLAS routines, daxpy, ddot, dscal, and idamax, that are also in the IBM high performance ESSL library. However, when the benchmark versions of these routines are replaced with the ESSL routines the performance attained is no better than 250 MFlops, worse than with the benchmark versions.
Fortran Livermore Loops
This version of the double precision Livermore Loop benchmark was obtained from livermore. This is the 1991 update of the benchmark whose earliest version dates from the 1970's. It contains 24 numeric kernels written in fairly straightforward, uncomplicated Fortran 77. Several summary figures are returned by the program at the end of the run.
Source Code Changes
The only changes to the original source required were to the timing routines. The SECOND function definition in the main routine at line 556 was uncommented:
REAL*8 SECOND
These three lines, 4469-4471, in the SECOND function were commented out and replaced with a call to the system elapsed time measurement function rtc():
C REAL*4 CPUTYM(4), ETIME C XT= ETIME( CPUTYM) C SECOND= CPUTYM(1) second=rtc()
Compile Changes
The code was compiled with the xlf compiler with no additional arguments beyond those for optimization.
Timers
The internal Compile Time result is the seconds required to compile and link the test program. To compare the effects of the various optimization levels, the Average (mean), Minimum, and Maximum MFLOP Rates for the loops returned by the code are listed.
Livermore Loop MFLOPS
| Optimization | Compile Time | Average | Minimum | Maximum |
|---|---|---|---|---|
| unoptimized | 1.63 | 48 | 11 | 141 |
| -O2 | 8.73 | 245 | 22 | 995 |
| -O3 -qstrict -qarch=pwr3 -qtune=pwr3 | 14.22 | 322 | 60 | 1259 |
| -O3 -qarch=pwr3 -qtune=pwr3 | 15.07 | 339 | 60 | 1257 |
| -O3 -qhot -qarch=pwr3 -qtune=pwr3 | 67.62 | 315 | 58 | 996 |
| -O4 -qnohot | 34.08 | 353 | 58 | 1532 |
| -O4 | 80.04 | 321 | 58 | 998 |
| -O5 -qnohot | 49.10 | 360 | 58 | 1541 |
| -O5 | 91.23 | 319 | 58 | 997 |
Comments
The performance of this benchmark is significantly degraded when the -qhot option is specified. Not only is the compile time greatly increased, but both the Average and Maximum MFLOPS are significantly worse than the corresponding optimization level without the -qhot option. This may be due to the fact that all of the loops are fairly small and uncomplicated, and the sophisticated analysis and loop restructuring done by this option add too much overhead at execution time.
Another interesting feature is that two of the higher level optimizations that do not include -qhot, -O4 -qnohot and -O5 -qnohot, are reported as attaining a MFLOP total greater than the theoretical peak performance of the POWER3 processor, 1.5 GFLOPS. The loop that attains this speed is Kernel 7, a very short loop representing an equation of state fragment:
1007 DO 7 k= 1,n
X(k)= U(k ) + R*( Z(k ) + R*Y(k )) +
1 T*( U(k+3) + R*( U(k+2) + R*U(k+1)) +
2 T*( U(k+6) + Q*( U(k+5) + Q*U(k+4))))
7 CONTINUE
Very likely, these optimizations make use of the fact that several of the elements in the equation are used in more than one iteration of the loop and need only be computed once. When the POWER3 hardware performance monitor is applied to this loop by means of hpmcount, the measured MFLOPS for this loop are around 750.
NAS Kernels
The NAS Kernel Benchmark consists of seven Fortran test kernels that perform calculations typical of scientific applications run at the NASA Ames Research Center. It was written in the 1980's and consists of approximately 1000 lines of Fortran code, organized into seven separate tests.
Source Code Changes
The only changes made were to the CPTIME internal timer routine. The original version was replaced by this SP specific version:
common /savetime/tx
real*8 rtc,tx,t
T = rtc()
if (tx.gt.t) tx=0
CPTIME = real(T - TX)
TX = T
RETURN
END
Compile Changes
The code was compiled with the xlf compiler with no options beyond the optimization options.
Timers
The Compile Time result is the wall clock seconds for the compile and link returned by the unix time command. In this table the average MFLOPS for all the kernels returned by the program's internal is given.
Timings
| Optimization | Compile Time | Average MFLOPS |
|---|---|---|
| unoptimized | .72 | 31 |
| -O2 | 3.07 | 75 |
| -O3 -qstrict -qarch=pwr3 -qtune=pwr3 | 8.31 | 79 |
| -O3 -qarch=pwr3 -qtune=pwr3 | 8.49 | 79 |
| -O3 -qhot -qarch=pwr3 -qtune=pwr3 | 42.01 | 91 |
| -O4 -qnohot | 15.96 | 79 |
| -O4 | 45.16 | 90 |
| -O5 -qnohot | 20.32 | 79 |
| -O5 | 44.78 | 90 |
Comments
This provides a contrast with the Livermore kernels in that the -qhot option significantly improves performance when it is added to other optimization options at the cost of of an almost five fold increase in compile time in some cases.
Individual Kernels
This benchmark also provides timings for the seven individual kernels.
- MXM - 4-way unrolled matrix multiply routine.
- FFT - Complex radix 2 fft.
- CHOL- Cholesky decomposition/substitution.
- BTRIX - Block tri-diagonal solver.
- GMTRY - Compute solid-related arrays, Gauss eliminate the matrix of wall influence coefficients.
- EMIT - Emit new vortices to satisfy boundary condition.
- VPENTA - Invert 3 pentadiagonals simultaneously.
Individual Kernel MFLOPS
| Optimization | MXM | FFT | CHOL | BTRIX | GMTRY | EMIT | VPENTA |
|---|---|---|---|---|---|---|---|
| unoptimized | 53 | 34 | 18 | 57 | 13 | 97 | 26 |
| -O2 | 236 | 162 | 136 | 210 | 18 | 150 | 37 |
| -O3 -qstrict -qarch=pwr3 -qtune=pwr3 | 753 | 157 | 192 | 218 | 18 | 153 | 36 |
| -O3 -qarch=pwr3 -qtune=pwr3 | 765 | 146 | 206 | 221 | 17 | 153 | 36 |
| -O3 -qhot -qarch=pwr3 -qtune=pwr3 | 424 | 164 | 210 | 287 | 18 | 472 | 57 |
| -O4 -qnohot | 763 | 149 | 207 | 220 | 18 | 156 | 37 |
| -O4 | 503 | 151 | 212 | 285 | 18 | 473 | 57 |
| -O5 -qnohot | 749 | 153 | 210 | 220 | 18 | 157 | 37 |
| -O5 | 431 | 162 | 211 | 303 | 18 | 473 | 57 |
References
These are useful references for IBM compiler optimization:
- Compiling for Performance on Bassi (PowerPoint Presentation)
- IBM Compilers for pSeries. This is a 2001 presentation at Scicomp describing many important compiler options and summarizing the operation of the different components of the compilers on codes.
- The T. J. Watson Research Center is the source of much of the information and many of the examples in this document.
- IBM SP Documentation: IBM XL Fortran User's Guide, IBM XL Fortran Language Reference, C/C++ Compiler and Language References.
- Power3 Introduction and Tuning Guide, SG24-5155-00. This is a highly detailed description of the POWER3 performance characteristics.
- IBM has a very useful manual entitled, "Optimization and Tuning Guide for Fortran, C, and C++", SC09-1705-01. Unfortunately there are two drawbacks to it: the most recent edition is dated June, 1996, so it is somewhat out of date, and it is not available on-line. A paper copy may be ordered from LINKWeb Order Support.
- The ascii file /usr/lpp/xlf/DOC/README.xlf contains the release notes for the current version of the Fortran compiler.
- XL Fortran: Eight Ways to Boost Performance
![]() |
Page last modified: Thu, 27 May 2004 04:56:56 GMT Page URL: http://www.nersc.gov/nusers/resources/software/ibm/opt_options/print.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |

