Automatic vectorization, in the context of a computer program, refers to the transformation of a series of operations performed linearly, one step at a time, to operations performed in parallel, several at once, in a manner suitable for processing by a vector processor. A vector processor, or array processor, is a CPU design that is able to run mathematical operations on a large number of data elements very quickly. ...
An example would be a program to sum the columns of a large table of numeric data. A linear approach would be something like:
for j = 1 to the number of columns in table T {
V[j] = 0.0;
for i = 1 to the number of rows in table T
{
V[j] = V[j] + T[i, j];
}
}
This could be transformed to vectorized code something like:
for j = 1 to the number of columns in table T {
Send column j of table T to an available vector processor to be summed
placing the result in V[j]; continue without waiting for the result;
} Wait for the completion of all the above vector operations;
The vectorized approach permits the sums of several columns to be computed at the same time, which will be faster than the linear approach if the time required to communicate with the vector processors is small compared to the time to compute the sums in the linear approach.
Note that because double elements are eight bytes wide and the vector loop processes two elements in each iteration, the upper bound and stride for the offsets into the arrays are 100x8=800 and 2x8=16, respectively.
For loops with a trip count that cannot be evenly divided by the vector length, a cleanup loop is used to execute any remaining iteration serially.
Using the SVML allows the compiler to proceed with vectorization of this loop as follows (an implementation that passes arguments and results in the xmm-registers is planned as well).