September 6, 2001 | main menu | index of documents |
This paper presents a collection algorithms for real-time scaling of bitmapped graphics with adequate quality. High-performance algorithms are usually low quality, and high-quality algorithms are typically slow. The algorithms presented here are, when examined individually, each a compromise between quality, generality and performance. When viewed as a whole, however, they complement each other and create a high-performance, good-quality zooming engine.
Quick scaling is relevant in animation, where (movie) frames must be scaled from the encoded size to the display size many times per second, or where a large set of sprites must be scaled on the flight --each one to a different size, perhaps-- to reflect a different view point.
First, I will discuss a simple algorithm, dubbed "smooth Bresenham" that can best be described as nearest neighbour interpolation on a zoomed grid, using a Bresenham algorithm. The algorithm is quick, it produces a quality equivalent to that of linear interpolation and it can zoom up and down, but it is only suitable for a zoom factor that is within a fairly small range. To offset this, I next develop a directional interpolation algorithm that can only magnify (scale up) and only with a factor of 2×, but that does so in a way that keeps edges sharp. This directional interpolation method is quite a bit slower than the smooth Bresenham algorithm, and it is therefore practical to cache those 2× images, once computed. Caching images with relative sizes that are powers of 2, combined with simple interpolation, is actually a third image zooming technique: MIP-mapping.
In 1965 J.E. Bresenham published an algorithm to approximate polynomal functions on a grid of discrete values. In computer graphics, the polynomal functions typically refer to lines and elliptical curves, and the pixels of the display form the grid. Bresenham's algorithm is based on two concepts: 1) rewrite the equation so that it expresses the difference between one outcome and the next, and 2) keep the (accumulated) fractional part of the outcome in a separate variable --and update the integral part as soon as the fractional part exceeds unity.
For example, the general equation of a line is y=a·x+b. Now
if we fix (x_{0},y_{0}) at some arbitrary
location and decide that x_{i+1}=x_{i}+1,
so that we step one pixel horizontally at each iteration, then
y_{i+1}=y_{i}+a.
The value of a will typically have a fractional part, so we introduce
another symbol to keep the accumulation of these fractional parts, giving:
y_{i+1}=y_{i}+int(a)
D_{i+1}=D_{i}+frac(a)
with one additional (overflow) rule: if D exceeds 1.0, we increase
y by one and subtract 1.0 from D.
The use of floating point arithmetic can furthermore be avoided by using a scaled integer for frac(a). For example, if a=^{Dy}/_{Dx}, where Dx and Dy are integers (and assuming Dx>Dy), you could replace D by E=Dx·D and state E_{i+1}=E_{i}+Dy. The "overflow rule" must be scaled accordingly, of course.
These reformulations of polynomals are widely used to draw lines, circles and ellipses, but the same algorithm can be used in resampling (scaling an image up or down, rotating an image), or other areas where sampling on a discrete grid is needed. For example, scaling a horizontal scan line the Bresenham way can be done with the code snippet below.
Coarse scaling with Bresenham |
void ScaleLine(PIXEL *Target, PIXEL *Source, int SrcWidth, int TgtWidth) { int NumPixels = TgtWidth; int IntPart = SrcWidth / TgtWidth; int FractPart = SrcWidth % TgtWidth; int E = 0; while (NumPixels-- > 0) { *Target++ = *Source; Source += IntPart; E += FractPart; if (E >= TgtWidth) { E -= TgtWidth; Source++; } /* if */ } /* while */ } |
Note that when enlarging a picture, IntPart is always zero and when shrinking an image, IntPart is only non-zero when shrinking to less than half the original size. Also, if IntPart is zero, FractPart is just SrcWidth. These observations will be useful later when I propose to restrict the zoom range used with the Bresenham algorithm.
When zooming with Bresenham, pixels are picked up from discrete positions in the source image and placed on discrete positions in the destination image. This is known as nearest neighbour sampling; when magnifying, pixels from the source image are replicated, when minifying (scaling down), pixels from the source image are dropped. Both pixel replication and pixel dropping give artefacts in the destination image. For accurate resampling, pixels should be read from fractional positions in the source image and cover a fractional area --rather than just one pixel. In practice, the size of a pixel is just one pixel and the coordinates of a pixel are confined to integer values in both the source and the destination images; there is no pixel at coordinate pair (3.12, 6.41). To improve quality, many resampling techniques approximate the value of a source pixel at a fractional coordinate pair by taking a weighted average of a cluster of source pixels around that particular fractional coordinate pair. The diverse resampling algorithms differ mostly in how many pixels from the source image they consider and how they weigh these values.
As we have seen, the error accumulator in the Bresenham algorithm serves to adjust the source pointer on an overflow. In his article "Scaling Bitmaps with Bresenham", Tim Kientle observed that the error accumulator can, at the same time, function as a weighting criterion for interpolating a destination pixel between two neighbouring source pixels. Unfortunately, the algorithm requires, per colour channel, one multiplication per source pixel plus another multiplication and one division per destination pixel, and for a colour image, one needs three channels. I am focusing of multiplications and divisions, because these are still fairly slow CPU instructions -- when compared with addition or bit shifting. If your images are not already in 24-bit RGB format, the algorithm requires you to convert each pixel to RGB before calculating the weighted average and convert it back to the proper pixel format afterwards.
I am proposing a light weight alternative, which is a compromise between the linear interpolation (which is what Tim Kienztle's algorithm does) and the coarse Bresenham scaling. My algorithm sets each destination pixel to either the value of the closest pixel or the (unweighted) average of the two neighbouring pixels. The criterion whether to replicate or to average is based on the position of the destination pixel relative to the grid of the source pixels. If the destination pixel is on top of a source pixel, or close to one, that source pixel is replicated. If, on the other hand, the destination pixel is closer to the midpoint between two source pixels, those two source pixels are averaged.
Below is the modified scan line scaling algorithm. As is apparent, there are now two criterions for the accumulated error: when it exceeds the midpoint, it must start calculating the average between two neighbouring pixels; when it exceeds unity, the source position must be adjusted.
Smooth scaling with Bresenham |
#define AVERAGE(a, b) (PIXEL)( ((a) + (b)) >> 1 ) void ScaleLineAvg(PIXEL *Target, PIXEL *Source, int SrcWidth, int TgtWidth) { /* N.B. because of several simplifications of the algorithm, * the zoom range is restricted between 0.5 and 2. That * is: TgtWidth must be >= SrcWidth/2 and <= 2*SrcWidth. */ int NumPixels = TgtWidth; int Mid = TgtWidth / 2; int E = 0; PIXEL p; if (TgtWidth > SrcWidth) NumPixels--; while (NumPixels-- > 0) { p = *Source; if (E >= Mid) p = AVERAGE(p, *(Source+1)); *Target++ = p; E += SrcWidth; if (E >= TgtWidth) { E -= TgtWidth; Source++; } /* if */ } /* while */ if (TgtWidth > SrcWidth) *Target = *Source; } |
Another change, whose purpose may not be immediately obvious, is that the loop stops one pixel before the final destination pixel is reached if the scan line is enlarged. When processing the last destination pixel on a scan line, the source pointer arrives at the last pixel in the source scan line. If, at that point, the algorithm decides to average pixels, it will attempt to average the last source pixel with the pixel one beyond the last. To avoid this error, the algorithm always replicates the last pixel on a scan line. Note that this works for enlarging up to 2×; for a larger zoom factor, you need to end the loop earlier still. Of course, one could check for an overflow of the source pointer inside the loop, but in the interest of performance, I have kept a maximum of code outside the inner loop. Later in this paper, I will show you that I do not intent to call this function with a zoom factor larger than 1.5, or one below 0.75.
A restricted zoom range simplifies the code. As mentioned earlier, the IntPart and FractPart variables (from the code snippet for coarse Bresenham scaling) are not needed when the zoom factor is known to be higher than 0.5. So in the smooth Bresenham scaling routine, I removed them. It is perfectly well possible to create a smooth Bresenham scaling routine that does not have any limits on the scaling factor; the coarse Bresenham scaling routine shows you what modification this requires.
In contrast to the coarse Bresenham scaling routine, averaging Bresenham scaling needs to know the pixel format of the data that it zooms. This is hidden in the AVERAGE macro; the code snippet assumes 8-bit gray scale pixels. In the AVERAGE macro, I have used a "shift-right" operator rather than a division by two to make it explicit that the algorithm requires only addition, subtraction and bit shifting. In real production code, I would simply write a division by two and trust that the compiler would optimize that particular integer division to a bit shift. Or actually, I would optimize the routine in assembler myself; being old-fashioned and stuborn, I still think that my assembler coding skills can beat a compiler's cunning algorithms and, unfortunately, practice has proven me right on all but a few occaisions.
The merit of the proposed algorithm is that a standard (unweighted) average is much simpler to calculate than a weighted average, not only for gray-scale but especially so for other pixel formats. Simplicity translates to "increased performance", here. The section "Averaging pixel values" covers the (approximate) calculation for averages for several common pixels formats.
As an aside, another way of seeing this algorithm is that I put imaginary source pixels between each pair of actual source pixels, and the colours of these imaginary pixels are averaged from the neighbouring actual pixels. This creates a virtual grid with twice the resolution, and I am applying the coarse Bresenham scaling algorithm on this new virtual grid.
Original picture |
|||
75% - Smooth Bresenham |
75% - Nearest Neighbour |
75% - Bilinear interpolation |
75% - Bicubic interpolation |
150% - Smooth Bresenham |
150% - Nearest Neighbour |
150% - Bilinear interpolation |
150% - Bicubic interpolation |
Original picture |
|||
75% - Smooth Bresenham |
75% - Nearest Neighbour |
75% - Bilinear interpolation |
75% - Bicubic interpolation |
150% - Smooth Bresenham |
150% - Nearest Neighbour |
150% - Bilinear interpolation |
150% - Bicubic interpolation |
The example images show that "smooth Bresenham" is not as smooth as bilinear or bicubic interpolation, but it is far better than coarse Bresenham scaling, while costing only little more than coarse Bresenham scaling. In particular, the most objectable properties of nearest neighbour interpolation, "dropped" pixels and excessive "jaggies" due to pixel splatting, are absent from smooth Bresenham scaling --provided that the zoom range stays within the range ^{2}/_{3} to 2. While the bilinearly interpolated image is smoother, note that it is actually blurred. In some areas of the example pictures, smooth Bresenham looks better than bilinear interpolation.
The result of bicubic interpolation, by the way, depends on the particular polynomal that is used. The linear interpolation and cubic interpolation examples show here were produced with Adobe Photoshop 5.5. Adobe Photoshop's bicubic interpolation filter has a sharpening effect. This may augment the perceived quality of the interpolation process, but creates artefacts at (steep) edges, as the example below shows. To emphasize the effect, I the bottom row shows two representative cut-outs of the pictures zoomed by 400%.
Original picture |
102% - Bicubic interpolation |
Original picture, zoomed by 400% |
102% - Bicubic interpolation, zoomed by 400% |
For small scaling factors, the interpolating filters cause the output picture to be blurred. This is apparent in the above figure, nonwithstanding the implied sharpening of the bicubic interpolation polynomal used. For these small scaling factors, one can reduce the blurring effect by "snapping" the target pixel to the nearest source pixel if it is "close enough", and interpolating it otherwise. A blend between nearest neighbour and interpolation, so to speak. With the smooth Bresenham algorithm, you achieve this by altering the midpoint criterion. In the ScaleLineAvg() function, the local variable Mid is set to TgtWidth / 2; increasing its value makes the target pixels snap earlier to source pixels (when Mid exceeds the value of TgtWidth, smooth Bresenham reduces to nearest neighbour).
The routines presented so far scale a single scan line horizontally. The routines are trivially extended to scale column-wise. Then, by first looping over all raster lines and consecutively looping over all raster columns of the image, you can scale it both horizontally and vertically.
A better solution, from the standpoint of performance, is to combine horizontal and vertical scaling in a single pass. To that end, the vertical scaling routine processes complete scan lines and it calls the horizontal scaling routine to scale one scan line. The "coarse" 2D scaling routine printed below, manifests a great similarity with the horizontal scan line scaling routine. I have made one optimzation: if the source pointer does not change in an iteration, rather than scale the same source scan line again, I just copy the result of the previous iteration.
Coarse 2D scaling |
void ScaleRect(PIXEL *Target, PIXEL *Source, int SrcWidth, int SrcHeight, int TgtWidth, int TgtHeight) { int NumPixels = TgtHeight; int IntPart = (SrcHeight / TgtHeight) * SrcWidth; int FractPart = SrcHeight % TgtHeight; int E = 0; PIXEL *PrevSource = NULL; while (NumPixels-- > 0) { if (Source == PrevSource) { memcpy(Target, Target-TgtWidth, TgtWidth*sizeof(*Target)); } else { ScaleLine(Target, Source, SrcWidth, TgtWidth); PrevSource = Source; } /* if */ Target += TgtWidth; Source += IntPart; E += FractPart; if (E >= TgtHeight) { E -= TgtHeight; Source += SrcWidth; } /* if */ } /* while */ } |
Smooth scaling in two dimensions is a bit more involved, because you must also average two scan lines. The smooth Bresenham scaling needs a "read ahead" of one --one pixel in the horizontal case and one smoothly scaled scan line in the vertical case. Since we also want to avoid scaling the same scan line twice, it is easiest to use a second temporary scan line buffer. In fact, the routine is getting much more convenient to create if you use two of those temporary buffers.
The smooth 2D scaling algorithm is below. Keep two things in mind when reading through this code snippet: 1) the routine calls malloc for two buffers but lacks any error checking. Use any way to allocate memory and any way to handle errors that suits you. 2) in general, I care less about optimizing this routine than I did for the scan line scaling routine because the latter routine contains the "inner loop" (performance bottlenecks are typically in inner loops). The exception is the for loop that averages two scan lines. That for loop is also an inner loop, so, in production code, I would keep an eye on it and check whether it benefits from hand optimization.
The special cases for the two scan line caches make the code fairly long and tricky. If you need to be able to scale multiple pixel formats, it is probably a good idea to create optimized versions for the smooth scan line scaling routines (one for each pixel format) and a single universal smooth vertical scaling function. The vertical scaling function then invokes the appropriate scan line scaling function through function pointers, polymorphism, or whatever other technique you have at your disposition.
In the routines presented here, one horizontal scan line from the source image is mapped to a horizontal scan line in the (scaled) destination image. The idea is more generally applicable, however: you could walk over the source image along a skewed line and write the output pixels on a horizontal scan line in the destintion. This results in combined rotation and scaling. The only new issue to solve is: how to walk along a skewed line over a discrete grid formed by the pixels of the source image. If you have followed the article this far, you know how to solve that: Bresenham.
The quality of the smooth Bresenham scaling algorithm quickly deteriorates when the zoom factor drops below ^{2}/_{3} or raises beyond 2. Within that range, every source pixel is represented, fully or partially, in the destination image and no source pixel is replicated twice in exactly the same way (a source pixel may attribute to two destination pixels, but with different weights). Outside that range, some pixels will get dropped or replicated, and both these phenomena are quite noticeable.
There is another well-known technique for scaling images with a better quality than nearest neighbour and that has low complexity, which goes by the name "MIP-mapping". Originally a MIP-map is a one-channel image split in four equally-sized quadrants. Three of the quadrants each hold one channel of the original colour image; the fourth quadrant is a copy of the first three quadrants scaled by a factor of ^{1}/_{2}. This process is applied recursively: the fourth quadrant of the sub-image is a sub-sub-image of the first three quadrants scaled by ^{1}/_{4}. MIP stands for "Multum In Parvo", freely translated as "many in little". It means that one bitmap holds many images, whose sizes are powers of two, while the total amount of memory space needed to keep these "many images" is just ^{4}/_{3} times that of the largest image. In the original MIP-mapping algorithm as published by Lance Williams, one obtained an output image with the requested size by first finding the section in the MIP-map with the nearest size and then interpolate from that.
Diverse implementations have deviated from the original MIP-mapping algorithms and bitmap lay-outs. That is why I prefer to use a looser definition of the concept: a MIP-map is a multi-resolution image where the relative resolutions are powers of two. How the multi-resolution image is implemented --whether it is a single bitmap with the colour channels readily split into separate quadrants or a list of separate bitmaps for the same scene at different scales-- doesn't matter in this context. The MIP-map is a basis for interpolation. Several computer games and multi-media titles use MIP-mapping in combination with a "nearest neighbour" interpolator, like the coarse Bresenham scaler, to scale the appropriate MIP-map section to the final size. I suggest that one uses the smooth Bresenham scaler instead: it offers higher quality at a low performance cost.
Using MIP-mapping in combination with smooth Bresenham circumvents the limitation that the smooth Bresenham algorithm is only suitable for a fairly small zoom range. When you can choose between image resolutions where each section has twice the size of its predecessor, the zoom factor that the final interpolator must handle lies between ^{3}/_{4} and ^{3}/_{2}. This is neatly inside the range of smooth Bresenham.
The MIP-mapping scheme assumes that the pre-scaled bitmaps to select amongst are already available. I assume that they might not be and that they need to be derived from a source image in real time. The algorithms to create large magnifications or minifications should therefore still run quickly, if they are going to be used in animation. If, in your application, it is likely that an image at a certain scale may be needed again in some near future, I suggest that you cache the 2× magnified and minified bitmaps after they have first been calculated. By doing this, you fill in the diverse resolutions of the MIP-map at run time on an "as needed" basis.
One of the simplest ways to halve the resolution of a picture is to replace each quad of four pixels by one pixel with is the average of the four pixels. Sampling theory calls this a "box filter", it is akin to linear interpolation. In my experience, the box filter gives good quality when minifying an image, but not everyone agrees with this (see for example Alvy Smith's memo).
When the average of four pixels is easy to calculate, by all means do so. If it is harder to calculate, such as with palette indexed images, you may opt to use the AVERAGE macro three times in succession, as is done in the code snippet below.
Scale ^{1}/_{2}×, box filtered |
void ScaleMinifyByTwo(PIXEL *Target, PIXEL *Source, int SrcWidth, int SrcHeight) { int x, y, x2, y2; int TgtWidth, TgtHeight; PIXEL p, q; TgtWidth = SrcWidth / 2; TgtHeight = SrcHeight / 2; for (y = 0; y < TgtHeight; y++) { y2 = 2 * y; for (x = 0; x < TgtWidth; x++) { x2 = 2 * x; p = AVERAGE(Source[y2*SrcWidth + x2], Source[y2*SrcWidth + x2 + 1]); q = AVERAGE(Source[(y2+1)*SrcWidth + x2], Source[(y2+1)*SrcWidth + x2 + 1]); Target[y*TgtWidth + x] = AVERAGE(p, q); } /* for */ } /* for */ } |
I have written the above code for clarity more than to serve as production code. One obvious optimization is to avoid calculating the same values over and over again: the expressions y2*SrcWidth, (y2+1)*SrcWidth and y*TgtWidth remain constant in the inner loop and should be substituted by pre-computed variables. Perhaps you can trust your compiler to do this optimization for you.
Original picture |
50% - Box filter |
50% - Bilinear interpolation |
50% - Bicubic interpolation |
Original picture |
50% - Box filter |
50% - Bilinear interpolation |
50% - Bicubic interpolation |
From the example images shown above, it is apparent that the box filter is equivalent to linear interpolation with this particular scale factor. The box filter (and the bilinear interpolator) perform better than the bicubic interpolator when minifying: the resulting images of the bicubic interpolator are blurred.
Among the more popular algorithms for enlarging a bitmapped image are pixel replication and bilinear interpolation; these algorithms are quick and simple to implement. The pixel replication algorithm simply copies the nearest pixel (nearest neighbour sampling). As a result, the algorithm is prone to blockiness, which is especially visible along the edges. Bilinear interpolation is at the other extreme, it takes the (linearly) weighted average of the four nearest pixels around the destination pixel. This causes gradients to be fluently interpolated, and edges to be blurred. Several other algorithms use different weightings and/or more pixels to derive a destination pixel from a surrounding set of source pixels.
The reasoning behind these algorithms is that the value of a destination pixel must be approximated from the values of the surrounding source pixels. But you can do better than just to take the distances to the surrounding pixels and a set of fixed weight factors stored in a matrix. Directional interpolation is a method that tries to interpolate along edges in a picture, not across them.
I will disclose my successful approach in a moment. First, I wish to spend a few words on earlier unsuccessful experiments.
The directional interpolation algorithm requires that you find the edges first, or rather, that you determine the direction of the gradient at every pixel. My first experiment was to find the gradient at every source pixel using a simple edge detector, which is a pair of Roberts masks. In the figure at the right, the Roberts masks are put in 3×3 matrices, but, as you can see, calculating the value of one mask involves just one subtraction. Either mask detects an edge in a diagonal direction. From the two values, you can calculate both the direction of the gradient and the "steepness" of the gradient.
Given the direction of the gradient, I chose between five interpolators: horizontal, vertical left-slanted, right-slanted and omni-directional. The omni-directional interpolator is a bilinear interpolator, it is used when the steepness of the gradient --in whatever direction-- is low. The other interpolators are just linear (not bilinear), but each chooses different source pixels to interpolate between.
Roberts masks are known to be sensitive to noise: they tend to detect an edge where there is no real edge. Since my goal is to simply choose between interpolators, I figured that choosing a directional interpolator on a pixel where there is no edge, would not matter much. I had not foreseen, however, that the sensitivity of Roberts masks might also result in returning the wrong direction of the gradient in the presence of an edge. In addition, it is not just sensitivity to noise that hampers Roberts masks, the number of points considered by a pair of Roberts masks is just too small to give a sensible indication of the gradient's direction and magnitude.
This, I tried to fix in a second experiment, using Sobel masks. Sobel masks are generally considered to be robust while still being simple. They consider three times as many source pixels, so their estimate of the gradient direction should be significantly more accurate. To my surprise, however, I found that Sobel masks would completely miss a horizontal or vertical line with a width of one pixel that runs through the middle of the masks. A one-pixel wide diagonal line on an angle of 45° or 135° is also missed. Both in my line graphic test image as in the Lena picture, this proved to cause severe "noise" due to wrongly determined gradient directions. In further experiments, I tried more simple edge detectors. Although it was not hard to improve on Sobel, the results were still not good enough.
The final approach that I have taken is to determine a (partial) gradient for each target pixel. Since the routine magnifies by 2, the algorithm must map each source pixel to four target pixels. The preceding experiments established the gradient for one source pixel and used that for four target pixels. The new approach calculates four gradients for each source pixel. The figure at the right represents nine pixels from the source image. To calculate the target pixel for the upper left quadrant for the middle source pixel, the algorithm considers the source pixels 0, 1, 3 and 4.
The algorithm is conceptually simple: of all source pixels under consideration, it interpolates between the pair of pixels with the lowest difference. The upper left target pixel may be:
The first step is to calculate the differences between the pairs of source pixels 4-2, 4-3, 4-0 and 3-1. Actually, I am not interested in the sign of the differences, just the distances. The algorithm determines the minimum distance and chooses the appropriate interpolator. If the minimum distance exceeds some limit, the algorithm decides to replicate the source pixel, rather than interpolate in any direction. This is intended to keep sharp horizontal and vertical edges sharp. Interpolation, in this particular context, just means (unweighted) averaging; so in the case of interpolator 1, the colour of target pixel is the halfway between of source pixels 4 and 1.
In the original code, the fourth interpolator just set the target pixel to the average of source pixels 1 and 3. In testing, I found that I obtained better results by interpolating with source pixel 4 afterwards too. Due to the two consecutive averaging operations, the target pixel's colour is ultimately set to ½S_{4} + ¼S_{1} + ¼S_{3} (where S_{i} stands for "source pixel i").
Below is a snippet that computes the "upper left" target pixel for a source pixel at coordinate (x, y). Using a similar procedure, one obtains the other three target pixels for this source pixel.
Scale 2× with directional interpolation |
#define ABS(a) ( (a) >= 0 ? (a) : -(a) ) #define PIXELDIST(a, b) ABS( (a) - (b) ) p = Source[y*SrcWidth + x]; /* target pixel in North-West quadrant */ p1 = Source[(y-1)*SrcWidth + x]; /* neighbour to the North */ p2 = Source[y*SrcWidth + (x-1)]; /* neighbour to the West */ p3 = Source[(y-1)*SrcWidth + (x-1)]; /* neighbour to the North-West */ d1 = PIXELDIST(p, p1); d2 = PIXELDIST(p, p2); d3 = PIXELDIST(p, p3); d4 = PIXELDIST(p1, p2); /* North to West */ /* find minimum */ min = d1; if (min > d2) min = d2; if (min > d3) min = d3; if (min > d4) min = d4; /* choose interpolator */ y2 = 2 * y; x2 = 2 * x; if (min > BOUNDARY) Target[y2*TgtWidth+x2] = p; /* do not interpolate */ else if (min == d1) Target[y2*TgtWidth+x2] = AVERAGE(p, p1); /* North */ else if (min == d2) Target[y2*TgtWidth+x2] = AVERAGE(p, p2); /* West */ else if (min == d3) Target[y2*TgtWidth+x2] = AVERAGE(p, p3); /* North-West */ else /* min == d4 */ Target[y2*TgtWidth+x2] = AVERAGE3(p, p1, p2); /* North to West */ |
Below, the directional interpolating algorithm is compared to bilinear and bicubic interpolation. Directional interpolation achieves sharper results than both bilinear and bicubic interpolation. Note that the original ``line graphics'' picture was intentionally created without anti-aliasing. It is to be expected that the output contains "jaggies" too.
Original picture |
200% - Directional interpolation |
200% - Bilinear interpolation |
200% - Bicubic interpolation |
Original picture |
200% - Directional interpolation |
200% - Bilinear interpolation |
200% - Bicubic interpolation |
For this algorithm to be feasible, you must have a way to determine the distance between two pixels. In the code snippet, this is hidden in the PIXELDIST macro. This "pixel distance" calculation is also likely to become the performance bottleneck of the application, which is why I will give it a little more attention.
For gray scale pixels, calculating the pixel distance is as simple as taking the absolute value of the subtraction of a pair of pixels. The SIMD instructions ("single instructions, multiple data", previously known as MMX) of the current Intel Pentium processors allow you to obtain absolute values of four signed 16-bit values in parallel, using only a few instructions. For example code, see Intel's application note AP-563.
For 24-bit true colour images, you have several options to calculate the pixel distance:
I have a slight preference for the lastly listed method: interleaving. It is actually a simple trick where an RGB colour that stored in three bytes with an encoding like:
R_{7}R_{6}R_{5}R_{4}R_{3}R_{2}R_{1}R_{0} - G_{7}G_{6}G_{5}G_{4}G_{3}G_{2}G_{1}G_{0} - B_{7}B_{6}B_{5}B_{4}B_{3}B_{2}B_{1}B_{0}is transformed to:
G_{7}R_{7}B_{7}G_{6}R_{6}B_{6}G_{5}R_{5} - B_{5}G_{4}R_{4}B_{4}G_{3}R_{3}B_{3}G_{2} - R_{2}B_{2}G_{1}R_{1}B_{1}G_{0}R_{0}B_{0}
The new encoding groups bits with the same significance together and the subtraction between two of these interleaved values is thus a good measure of the distance between the colours. To calculate the interleaved encoding quickly, Tim Kientzle proposes to use a precalculated lookup table that maps an input value in the encoding:
A_{7}A_{6}A_{5}A_{4}A_{3}A_{2}A_{1}A_{0}to:
0 0 A_{7}0 0 A_{6}0 0 - A_{5}0 0 A_{4}0 0 A_{3}0 - 0 A_{2}0 0 A_{1}0 0 A_{0}With this table, you need only three table lookup's (for the red, green and blue channels), two bit shifts and two bitwise "OR" instructions to do the full conversion. That is:
Calculate interleaved value for a colour |
typedef struct __s_RGB { unsigned char r, g, b; } RGB; unsigned long InterleavedValue(RGB colour) { static unsigned long InterleaveTable[256] = { 0x00000000, 0x00000001, 0x00000008, 0x00000009, 0x00000040, 0x00000041, 0x00000048, 0x00000049, 0x00000200, 0x00000201, 0x00000208, 0x00000209, 0x00000240, 0x00000241, 0x00000248, 0x00000249, /* etc. etc. */ }; return (InterleaveTable[colour.g] << 2) | (InterleaveTable[colour.r] << 1) | InterleaveTable[colour.b]; } |
I suggest that you also use bit interleaving to calculate pixel distances between 16-bit HiColor colours. If you use two different pre-calculated tables, you can compute the interleaved colour with only two lookup's and one bitwise "OR". Alternatively, you could create a 64 kBytes table, but, due to CPU memory cache issues, this may not be quicker.
For 256-colour palette-indexed pixels, I also suggest that you create a lookup table that converts a palette index to an interleaved RGB colour and to proceed from that.
As mentioned earlier, the pixel distance algorithm is likely to become the limiting factor in the routine's performance. One implementation-specific optimization trick was already mentioned: exploit parallelism by using SIMD instructions. There are also a couple of algorithmic optimizations that help. In the code snippet for the upper left target pixel, we can count that we need to calculate four pixel distances. For four target pixels per source pixel, that amounts to 16 pixel distances. The first opportunity for optimization is to recognize that there is overlap in the neighbouring pixels used for each of the target pixels. When grouping the calculations of all pixel distances for one source pixel (four target pixels), one sees that there are 12 unique pixel distances that must be calculated --already a 25% savings from the original 16.
Additionally, there is an overlap in the pixel distances that are needed for the current source pixel and those for the next source pixel. Every source pixel uses five pixel distances that are shared with those for the previous pixel. This brings the number of pixel distances to be calculated down to 7 per source pixel. In the figure at the right, the blue and red arrows indicate the distances for the current pixel and the red and green arrows those for the next pixel. The five red arrows are thus the pixel distance calculations that are shared between the current and next pixels.
A final comment to the code snippet is on the sequence of if statements to find the minimum value. At first I wanted to use the CMOV instruction of the Pentium-II CPU to calculate the minimum value without jumps. An application note by Intel showed that one can do better, by, again, a creative use of SIMD instructions.
Throughout the descriptions of the scaling algorithms presented here, there was an implicit assumption that calculation of an unweighted average of two colours is easy to calculate, regardless of the pixel format. This section presents (approximate) calculation methods for a few pixel formats.
Gray scale pixels are the easiest. The code snippet below averages two gray scale pixels that are passed in as parameters a and b.
Averaging gray scale pixels |
typedef unsigned char PIXEL; #define AVERAGE(a, b) (PIXEL)( ((a) + (b)) >> 1 ) |
24-bit RGB pixels can be throught of as three independent channels, where you process each channel individually. With this thought, you would call the gray scale averaging macro on the red, green and blue channels of a 24-bit colour pixel individually. There is a quicker --though less accurate-- way, however.
Assuming that the three RGB channels are packed in one 32-bit integer where the fourth byte of the integer is zeroed out, you can take the average of two of such integers in a single step, provided that you attent to overflows of individual channels. For example, the lay-out for a packed pixel is:
I have named the fourth (zeroed out) channel A (for "auxiliary"). When adding two of these packed pixels, two B_{7} bits can cause a "carry" that is added to the G_{0} bits. We want to avoid that the addition of the two blue channels influences the green channel, and one way to do that is by making sure that the G_{0} bits do not hold relevant data. That said, a quick way to make the G_{0} bits suitable "drop buckets" for the carry bit of the sum of two B_{0} bits, is to zero the G_{0} bits out before the summation. In code:A_{7}A_{6}A_{5}A_{4}A_{3}A_{2}A_{1}A_{0} - R_{7}R_{6}R_{5}R_{4}R_{3}R_{2}R_{1}R_{0} - G_{7}G_{6}G_{5}G_{4}G_{3}G_{2}G_{1}G_{0} - B_{7}B_{6}B_{5}B_{4}B_{3}B_{2}B_{1}B_{0}
Averaging 24-bit RGB pixels |
typedef unsigned long PIXEL; #define AVERAGE(a, b) ( (((a) & 0xfefefeffUL) + ((b) & 0xfefefeffUL)) >> 1 ) |
The AVERAGE macro first zeros the G_{0}, R_{0} and A_{0} bits; then adds the two packed values and divides by two. There is no need to zero the B_{0} bits; I did zero the A_{0} bits because there is no harm in doing so, but that bit will likely already be zero.
Obviously, we loose one bit of precision (for the green and red channels) when using this trick. By throwing away the lowest bit, we are reducing the colour resolution of the red and green channels from 8 to 7 bits. However, this is invisible in most applications.
HiColor modes with a bit resolution of 15 or 16-bits are quite popular; they give a nice compromise between number of colours, low memory (and memory bandwidth) requirements for bitmapped graphics and ease of use. To average two 16-bit HiColor pixels, you can basically use the same trick as the one for 24-bit pixels. The loss of one bit of colour depth per channel cannot be done away with as easily, however, as the colour depth for the red channel is only 5 bits. Reduction to 4 bits might be quite visible.
The situation where the loss of colour precision is especially objectable is when the two input pixels are the same: when a and b are equal, one expects the average to be equal to a (and b); due to the lowest bit being thrown away for the green and red channels, this may no longer be so. This particular case is easy to fix, however:
Averaging 16-bit HiColor pixels (565 format) |
typedef unsigned short PIXEL; #define AVERAGE(a, b) (PIXEL)( (a) == (b) ? (a) \ : (((a) & 0xf7dfU) + ((b) & 0xf7dfU)) >> 1 ) |
Another refinement is to recognize that the two most significant "blue" bits --the B_{4} bits for HiColor pixels-- can never overflow into the least significant "green" bit (G_{0}) of the result when they are both zero. That is, if the two B_{4} are zero, there is no need to clear the G_{0} bits before the addition of the two HiColor values. Creating a better mask involves just a bit of extra code:
Improved averaging 16-bit HiColor pixels (565 format) |
typedef unsigned short PIXEL; static PIXEL mask; #define GETMASK(a, b) (PIXEL)( ~((((a) | (b)) & 0x0410) << 1) ) #define AVERAGE(a, b) (PIXEL)( (a) == (b) ? (a) \ : ( mask = GETMASK(a, b), (((a) & mask) + ((b) & mask)) >> 1) ) |
The idea behind these macros is that we retrieve the B_{4} and the G_{5} bits of the combination of both HiColor pixels ("combination" meaning "bitwise or" here). All other bits are not interesting and these are simply masked of. Then the resulting B_{4} bit is shifted into the position of G_{0}, and G_{5} moves to position R_{0} at the same time. If both original B_{4} bits were zero, there is now a 0 at the G_{0} position; if either original B_{4} was one, there is now a 1 at G_{0}, and likewise for G_{5} versus R_{0}. Inverting all bits now results in the appropriate mask for the subsequent AVERAGE macro.
Averaging pixels for 256-colour, palette indexed images is harder. The quickest averaging method that I have come up with is a table lookup for each pair of possible palette indices. As the lookup table must be precomputed, which is a lengthy process, this is only suitable to scale multiple images based on a single palette. Fortunately, this is mostly the case.
The lookup table is 64 kBytes in size: 256 rows × 256 columns where each element is the palette entry that best approximates the average colour of the two palette entries that select the row and column. Due to symmetry in the lookup table, it suffices to calculate 33,024 (half of 64 k, plus 256 for the diagonal) average colours.
The lookup table is too large for the primary cache (L1 cache) on the CPU, and therefore the bottleneck in this routine may well be the memory fetch subsystem on the CPU.
Averaging palette-indexed pixels |
typedef unsigned char PIXEL; PIXEL average_table[65536]; /* precomputed */ #define AVERAGE(a, b) average_table[((a) << 8) + (b)]; typedef __s_RGB { unsigned char r, g, b; } RGB; void MakeAverageTable(PIXEL *table, RGB *palette) { int x, y; RGB m; for (y = 0; y < 256; y++) { for (x = y; x < 256; x++) { m.r = (unsigned char)( ((int)palette[x].r + (int)palette[y].r) / 2); m.g = (unsigned char)( ((int)palette[x].g + (int)palette[y].g) / 2); m.b = (unsigned char)( ((int)palette[x].b + (int)palette[y].b) / 2); table[(y << 8) + x] = table[(x << 8) + y] = PaletteLookup(palette, m); } /* for */ } /* for */ } |
The critical function inside MakeAverageTable that is not discussed here, is PaletteLookup. This function should return the palette index that comes closest to the given colour. Diverse methods exist to quantize a colour to a palette, for example, bit interleaving for RGB colours makes it suitable to sort a palette on the interleaved values and to find a closest match by binary search. An alternative is to create a 3-dimensional lookup table in an intelligent way; see for example the article by Leonardo Zayas Vila.
The paper started with the description of a new algorithm that is low cost (meaning "very fast") and that significantly improves on nearest neighbour interpolation. The drawback is a fairly low range of applicable zoom factors.
Two other algorithms were subsequently presented to obtain both very small and very large zoom factors, in steps to 2×. The reduction routine is a simple quad-pixel averaging algorithm (box filtering). The enlargement routine is a novel algorithm; it achieves good quality, but its performance is not spectacular. To that end the resulting images of these 2× scaling routines are cached for future use.
A large-range scaling algorithm can then be formed by a combination of the three scaling algorithms, using a MIP-mapping scheme.
Throughout the paper, the code snippets and the descriptions kept focusing on performance. Implementation details were exhausted on where they were deemed useful. Actual measurements, though, are missing: benchmarks are dependent on the hardware on which they run and on the implementation of the software. As the code snippets presented in this article are not optimized for performance and as my computer may not be representative for the typical machine of your application's target audience, I have left the evaluation and optimization up to you.
© Thiadmer Riemersma, ITB CompuPhase, 2001, The Netherlands
http://www.compuphase.com