-On
The compilers allow you to specify a general level of optimization by specifying
a numeric optimization level with the -O flag. The higher the number
the greater the amount of optimization the compiler does, the longer the compile
takes, and the
more memory the compile uses. The lowest numeric optimization is -O2.
There are no -O0 nor -O1 optimization arguments currently supported by the compilers.
-O2 (-O)
The -O2 option is designed to provide an intermediate level of optimization that
does not require an excessive amount of time to perform the compile and will produce
numeric results identical to those produced by an unoptimized compile. It avoids certain
types of optimizations that have the potential to produce different numeric
results. See the section on the
-qstrict argument below for a discussion on how the exact
equality of numeric results is accomplished.
The -O option is identical to the -O2 option.
The optimizations done at the -O2 level include:
-
Value numbering - folding several instructions into a single instruction.
-
Branch straightening - rearranging program code to
minimize branch logic and combining physically separate blocks of code.
-
Common expression elimination - eliminating duplicate expressions.
-
Code motion - performing calculations outside a loop (if the variables
in the loop are not altered within it) and using those results within the loop.
-
Reassociation and strength reduction - rearranging calculation sequences
in loops in order to replace less efficient instructions with more efficient
ones.
-
Global constant propagation - combining constants used in an expression and
generating new ones.
-
Store motion - moving store operations out of loops.
-
Dead store elimination - eliminating stores when the value stored is never
referred to again.
-
Dead code elimination - eliminating code for calculations that are not
required and portions of the code that can never be reached.
-
Global register allocation - keeping variables and expressions in registers
instead of memory.
-
Instruction scheduling - reordering instructions to minimize program
execution time.
This is a dot product example of the store motion optimization done
at the -O2 level.
Fortran
x=0.0
do i=1,ilim
x=a(i)*b(i)+x
enddo
C
x = 0.0;
for ( i=0 ; i < ilim ; i++ )
{
x+= a[i] * b[i] ;
}
The unoptimized, default compile follows all the source code instructions literally.
In this case, for each iteration of the loop, there would be a new load and a new
store of the variable x. With -O2 optimization, the compiler would recognize
that there is no need to store the value of x until the loop is completed, and intermediate
values would be kept in registers. Even if the loads and stores are cached, the
optimization could lead to an order of magnitude or better improvement in the
performance of this loop.
-O3
The -O3 level of optimization peforms all of the optimizations done
at the -O2
level as well as several other optimizations that require
more memory or time to accomplish.
Some optimizations may be done that will change
the semantics of the program slightly, and might cause
numeric differences between the results of the program and the same
program compiled at the -O2 optimization level or with no optimization.
To disable those optimizations that might produce different results,
include the -qstrict option on the compile line
after -O3 is specificied.
These are the types of optimizations done at this level that are not done
at the -O2 level:
-
Rewriting floating-point expressions:
Computations such as a*b*c may be rewritten
as a*c*b if, for example, an opportunity exists to get
a common subexpression by the rearrangement. This
is not done at the -O2 level, since it may give different numeric results.
In addition, divides are replaced by multiplies by the reciprocal at
this level. Divides on the POWER3, as on most processors, are very
expensive operations. They require 14 cycles for 32 bit floating point
and 18 cycles for 64 bit floating point operands. The floating-point reciprocal
estimate function and multiply are much cheaper operations, at the cost of potentially
different numeric results.
-
Aggressive code motion and scheduling:
At this optimization level the compiler will rearrange the code and instruction
sequence much more aggressively. In particular,
computations that have the potential to raise an exception
whose execution is conditional in the program
might be definitely scheduled at this level if this might lead to improvements
in performance.
In other words,
load and floating-point computations may be placed onto execution paths where
they will be executed even though, according to the actual semantics
of the program, they might not have been.
Loop-invariant floating-point computations that are found on some, but
not all, paths through a loop will not be moved at -O2 because the computations
may cause an exception. At -O3, the compiler will move the computations if the
move is not certain to cause an exception.
The same principle is followed when it comes to moving
many kinds of loads. Although a load
by means of a pointer will never be moved, at the -O3 optimization level
the compile will move
other types of loads for a potential performance improvement.
Loads in general are not movable at the -O2 level of optimization
because a program can declare a static array and
then load to an element of that array far beyond the declared boundary
which might cause a segmentation violation.
The same principles is followed when it comes to scheduling instructions
as this example shows.
Example: In the following example, at the -O2 optimization level
the computation of b+c is not
moved out of the loop for two reasons: it is considered dangerous because
it is a floating-point operation and could thus possibly cause an exception
and it does not occur on every path through the loop, so it potentially may
never be executed. For this reason, at -O2 the loop invariant b+c computation may be
performed many times based on the values of the elements of the array a.
At -O3, the computation is moved outside the loop and done only once.
Fortran
do i=1,ilim-1
if(a(i).lt.a(i+1)) a(i)=b+c
enddo
C
for (i = 0 ; i < ilim-2 ; i++)
{
if (a[i] < a[i+1])
a[i] = b + c ;
}
-
Incorrect sign for zero:
-O3 also will do some optimizations not performed at -O2 because they
may produce an incorrect sign in cases with a zero result. For example,
the expression "x + 0.0" would not be replaced with "x" at the -O2 level.
A redundant add of 0.0 to x will be done
because x might be equal to -0.0 and, under IEEE rules, -0.0 + 0.0 = 0.0
which would be -x in this case. Since, in the overwhelming majority of
cases, this has no significant impact on a program's results, -O3 will
substitute "x" for "x + 0.0".
Some limitations of the -O3 level are:
-
It does not do any processor specific optimizations like those done
by the -qarch or -qtune options.
-
It does not optimize complex loops or those with 3 or more loop indices
very well. See the discussion under -qhot.
-
Integer divide instructions are not optimized.
-O4
The -O4 level of optimization peforms all of the optimizations done
at the -O3
level as well as several other optimizations.
This argument is equivalent to:
This flag should be specified both at compile and link time.
-O5
The -O5 level of optimization peforms all of the optimizations done
at the -O4
as well as the optimization specified by
-qipa=level=2 which is described below.
This argument is equivalent to:
This flag should be specified both at compile and link time.
|