Managing the convnet output on terravoxel images often involves producing summary images that can more easily be downloaded, displayed, and processed on inexpensive hardware. Common methods for downsampling ordinary photographs or microscope images work by defining a window on the image and then applying filters like averaging or lanczos3 (sinc) to summarize the contents of the window into a smaller set of pixels. These blending methods are unsuitable for segmentation labels. The downsampling of a set of segmentation labels must contain actual pixel values from the input image as the labels are categorical and blending the label is nonsensical. If pixels labeled 1 refer to cars and pixels labeled 3 refer to birds, then the average of the two pixels, 2 referring to people, is not a faithful representation of the underlying image.
Thus, downsampling categorical labels consists of defining windows on an image and selecting an exemplar from that block. A common method is to choose the exemplar by picking among the most frequent pixels in a block, also known as finding the mode. The most obvious means of accomplishing this involves counting the frequency of each label, which is easily accomplished in a high performance language like C. However, Python loops are very slow which makes this method untenable without the use of C extensions (Cython) which can make a project more cumbersome to maintain and requires specialized knowledge. Here I present a method COUNTLESS that computes the mode of four unsigned integers without counting along with a Numpy implementation useful for generating 2x downsamples of labeled images.
COUNTLESS Algorithm
The simplest 2D downsampling problem to solve is the four pixel 2×2 image. A 2×2 image can be summarized by its single most frequent pixel to achieve a 2x reduction on each side. Sticking to even sized images for now, a larger image can be divided into 2×2 blocks, and if this process is repeated independently across each block, this will result in an overall 2x reduction of the whole image.
The 2×2 image consists of five problems listed in Figure 1. In 1(a), 1(c), and 1(e), all the pixels are in the most frequent class and are thus valid solutions. 1(b) and 1(d) require a more sophisticated approach. Notably, in all of the five cases, choosing a random pixel is more likely than not to be in the majority, which shows why striding can be an acceptable, though not perfect, approach. The key insight that forms the foundation of the algorithm is that in all cases, if two pixels match, they are in the majority.

In the following text, capital letters A,B,C,D refer to a pixel location’s non-zero value. We define the comparison operator PICK(A,B) that generates either a real pixel value or zero.
PICK(A,B) := A if A == B else 0 EQN. 1
In frameworks like Python/numpy where True and False are represented as one and zero respectively, this can be implemented numerically as:
PICK(A,B) := A * (A == B) EQN. 2
Next, let’s consider the various interactions between A, B, & C. In the following table, the notation AB means PICK(A,B) and lower case letters refer to a particular value, so a repeated letter ‘a’ in two columns means for example that both pixels are red.
Pixel PICK(X,Y)
A B C AB BC AC
a a a => a a a
a b a => 0 0 a
a a b => a 0 0
b a a => 0 a 0
a b c => 0 0 0 <-- Only fully zeroed row
TABLE 1. PICK(X,Y) (denoted XY) interactions between A, B, and C
Table 1 shows that the only the majority pixel or zero of A, B, and C will appear as the result of PICK operations. In the case that A, B, and C are all different, all PICKs will return zero. This makes the problem of pixel selection amenable to a simple logical. A, B, and C all being different corresponds to either case 1(b) or 1(e) in Figure 1, with D being in the majority in the 1(b) case. If the case is 1(b), that means D is an acceptable solution. If the case is 1(e), there is no majority pixel and D is also an acceptable solution.
Therefore, when A, B, or C match, choose the match, and when none of them do, choose D. This can be expressed in a computer language with a short circuiting logical OR (||) as:
MODE(A,B,C,D) := PICK(A,B) || PICK(B,C) || PICK(A,C) || D EQN. 3
We can implement logical OR numerically as:
LOGICAL_OR(X,Y) := X + (X == 0) * Y EQN. 4
EQN. 3 and EQN. 4 will handle all unsigned integer values correctly except for zero. Therefore, in cases where zero is a valid pixel, we can add one to the image at the beginning of the algorithm and subtract one before returning the result.
The 2×2 approach can be easily extended to cover any even dimensioned image. Simply cover the image with non-overlapping 2×2 blocks and solve the mode within each of them to generate the downsampled image. However, we must still deal with odd images, where the edge is not perfectly covered by a 2×2 block.
Luckily, there is a simple solution. For any odd image, mirror the edge to generate an even image. There are two cases: a corner (1×1) and an edge (2×1 or 1×2). Mirroring a corner will generate case 1(a), which will lead to that same pixel being drawn. Mirroring an edge will lead to either case 1(a) if the pixels are the same or else 1(c), both of which will be handled correctly by COUNTLESS.
Implementation

Implementation of COUNTLESS in Numpy is straightforward. First the image must be divided into a covering of 2×2 blocks. This is represented by creating four 2D arrays, a, b, c, & d, each conceptually representing its eponymous pixel from Figure 1, but executed in parallel across each 2×2 block in the image. This is accomplished by striding (2,2) offset by (0,0), (0,1), (1,0), and (1,1) from the upper-left corner. Next, we begin computing the result using COUNTLESS. Numpy does not support logical OR, but it does support bitwise OR. Luckily, according to Table 1, the values resulting from PICK(A,B), PICK(A,C), and PICK(B,C) will be either a single, possibly duplicated, value or zero. Therefore, in this special case, bitwise OR behaves the same as logical OR, saving us several operations that would otherwise be required in EQN. 4. Listing 1 shows the implementation:
This implementation will work for most cases, but it has an important failure mode. If the matching pixels are zeros, we’ll choose D by accident as the result will look the same as the last row in Table 1. Unfortunately, this problem can’t be completely eliminated when working with finite integer representations, but we can get very close.
The strategy is to add one to the image before executing COUNTLESS and subtract one after. This turns zeros into ones and makes the algorithm work correctly, but it causes an overflow for the maximum valued integer (255 for uint8s, 65,535 for uint16, et cetera). However, casting to the next largest data type before adding one eliminates the overflow effect (i.e. casting a uint8 array to uint16). On current hardware, this method is feasible up to uint64. The zero problem is completely eliminated for uint8, uint16, uint32, but not uint64. This means the algorithm will fail if your labels include 2⁶⁴-1 which is about 1.84 x 10¹⁹. For many uses, this should be acceptable. Cast back to the original data type after subtracting one. For coding schemes which treat a maximum uint64 as a special flag, simply change the offset sufficiently to account for it.
One last thing, we’ve added a few operations to account for the zero label, but that hurts performance. We can recover some of it by noticing that ab and ac both multiply by a. We can simplify that multiplication to remove an operation.
Performance
In order to ascertain how fast this algorithm is, a comparison suite was developed and run on a dual core 2.8 GHz, i7 Macbook Pro (circa 2014), 256 kB L2 cache, 4MB L3 cache. While the algorithm was developed for segmentation labels, ordinary photographs are included to demonstrate how the algorithms perform when the data aren’t nicely uniform. Max pooling, averaging, and striding were included for speed comparisons even though they are unsuitable for this task.
I used Python 3.6.2 with numpy-1.13.3 and clang-802.0.42 for the following experiments.
The tested algorithms:
- striding: Pick every other pixel.
- counting: Count the frequency of each label’s occurrence.
- simplest_countless: simplest_countless.py
- quick_countless: simplest_countless.py + algebraic simplification
- zero_corrected_countless: zero_corrected_countless.py
- countless: countless.py
- countless_if: Using if statements instead of fancy tricks
These algorithms were also tested even though they are inappropriate for handling segmentation to provide a point of comparison for other image processing algorithms:
- downsample_with_averaging: Average pixels in the 2×2 window.
- downsample_with_max_pooling: Pick highest value pixel in 2×2 window.
- ndzoom: Use the scipy function ndimage.interpolation.zoom to downscale 2×2.
The code used to test the algorithms can be found here.
After this article was published Aleks Zlateski contributed a Knight’s Landing (KNL) vectorized version of bitwise COUNTLESS that is reported to have run at 1 GPx/sec, 4 GB/sec on random int32 data. It can be found on Github.
Trial 1–Segmentation of Neural Tissue

RGB Segmentation is a 1024×1024 image of pixel labels assigned by a convolutional neural network amusingly looking at neural tissue. This is the kind of image countless was designed with in mind. Each pixel is an RGB triad that taken together represents a single unsigned integer. For example, (R,G,B): (15, 1, 0) represents 271 (15 + 1 * 256).
In Table 2, while it doesn’t fulfill our criteria of choosing the most frequent pixel, striding is clearly the speed demon. It seems to be (sensibly) limited only by memory bandwidth. Naïve counting runs at only 38 kPx/sec, meaning that it takes about 27.6 seconds to compute a single image.
Various versions of the countless algorithm clock in across a wide range from 986 kPx/sec to 38.59 MPx/sec, beating counting handily. _countlessif is actually a variation the counting implementation __ that uses if statements to test if two pixels match. The major differences in performance between the other countless variants depends on whether we handle zero correctly (a 37% difference between _countles_s and _quickcountless and 39% between _simplest_countles_s and _zero_corrected_countles_s) and whether a multiplication is simplified away (a 13.8% speedup between _simplest_countles_s and _quickcountless and a 15.6% speedup between _zero_corrected_countles_s and _countles_s).
Comparing countless, the fastest comprehensive variant of the algorithm with two other common approaches to downsampling, it comes out to be about 1.7x slower than averaging and 3.1x slower than max pooling. If we were to process images that do not contain zero as a valid pixel, the relative differences would be 1.3x slower and 2.3x slower respectively.


Trial 2 – Gray Segmentation

What is the effect of a three channel memory layout on algorithm performance? The same battery of tests was run on a grayscale (i.e. single channel) version of the Trial 1 image. This trial is more similar than Trial 1 to measuring performance on a real world task, though we more commonly operate on uint16, uint32, and uint64 arrays than uint8.
The gray image has three times fewer bytes per pixel than the RGB. If the relationship is simply linear, then one would expect the MB/sec figure to remain roughly constant with a three times improvement in MPx/sec, but this is not the case. For _simplestcountless, the MB/sec is about 17.4x faster in grayscale. This is likely due to non-contiguous layout of the RGB channels in memory, and the grayscale benefits from this improvement in memory access efficiency. The number of iterations was increased to one thousand for this trial to allow the experiments to run for a roughly similar length of time to Trial 1. Since counting and _countlessif were already known to be slow, for convenience they were measured at five iterations which still resulted in substantial wall clock time.
A C implementation of counting, quick_countless, and countless_if were also tested. As expected, the C code beat Python by between about 2.9x for _quickcountless to 1025x for _countlessif on the MPx/sec measure. While it wasn’t surprising to see _quickcountless gain a large speed boost from C implementation, the dramatic gains in _countlessif were impressive such that it became the winner at 3.12 GPx/sec.



Trial 3— Gray Ice Cream Man (GICM)

Gray Ice Cream Man (GICM) is a relatively large DSLR photo converted to gray scale. There is significant dynamic range and blurring effects, making the image vary significantly from pixel to pixel. This means that branch prediction in the CPU pipeline will fail much more frequently than in Trial 2. The large size of the image makes each trial a more stable test as well as it allows more time for CPU performance to even out.
Again, in Table 4, striding is clearly the winner. Naive counting runs at only 44 kPx/sec, meaning that it takes about 171 seconds to compute a single image. Various versions of the countless algorithm clock in across a wide range from 2.4 MPx/sec to 594.7 MPx/sec, beating the counting algorithm handily.
The major differences in performance between the other countless variants depends on whether we handle zero correctly (a 3.2x difference between countless and _quickcountless and 3.2x between _simplestcountless and _zero_correctedcountless). The algebraic simplification accounts for a gain of 14.9% between _simplestcountless and _quickcountless, and 16.2% between countless and _zero_correctedcountless.
Interestingly, _quickcountless performed much better against _downsample_withaveraging and _downsample_with_maxpooling in this case compared with Gray Segmentation.
The C results here are very similar to Trial 2, however there are some interesting features to note. _countlessif fell 617 MPx/sec (~20%). This is probably due to an increase in branch prediction failures on a non-homogenous image. _quickcountless remained steady within the qualitative but not quantitatively measured margin of error. _countlessif is much faster, but quick_countless is more predictable in its performance on different images, though the variation in performance of _countlessif does seem to be consistently above that of _quickcountless on the tested images.



Discussion
A standard Python/numpy implementation of COUNTLESS represents a large performance gain over a naïve implementation of the counting approach and is comparable in performance to averaging and max pooling, simple approaches heavily used in the image processing community. In a production image processing pipeline in Seung Lab, we often process blocks of 64 images of size 2048×2048 for downsampling. It’s important that the processing time be comparable to the download time to ensure an efficient pipeline, and COUNTLESS does the job.
COUNTLESS has benefits in a C implementation. An optimized counting implementation was able to achieve 880 MPx/sec on GICM, about 1.48x faster than the fastest Python implementation of _quickcountless. However, the fastest bitwise operator based C implementation of _quickcountless achieved 1.9 GPx/sec, and an if statement based implementation _countlessif achieved 2.5 GPx/sec, meaning a single core could process about 9 2048x2048x64 blocks per second versus about 3 using counting. For reference, a given dataset can consist of tens of thousands of these blocks or more.
However, it seems that at least in C, the cleverness associated with bitwise operators might not be so useful. After all, simple if statements beat them. However, there is a technological niche where the bitwise operators win. We have demonstrated that COUNTLESS can be useful in Python/numpy, however, the general method seems likely to succeed in other interpreted languages with vectorized operators. Bitwise COUNTLESS might also be worthwhile in MATLAB, Octave, R, and Julia, though Julia is a compiled language. The bitwise variant seems particularly well suited to GPU implementation, where if statements are very costly. While the circumstances would have to be fairly special for this to be practical, it seems possible to speed up the C implementation of bitwise COUNTLESS considerably with vectorized instructions if the input were rearranged using a Z-order curve. (After this article was published, a KNL vectorized implementation by Aleks Zlateski achieved 4 GB/sec speeds, maxing out memory bandwidth.)
The COUNTLESS algorithm allows for the rapid generation of 2x downsamples of segmentations based on the most frequent value. It’s conceivable that this ability may be useful in various Machine Learning pipelines, possibly even within algorithms (though the situation would need to be peculiar to favor modes over max pooling). COUNTLESS is a general method for finding the mode of four numbers, there may even be other applications that aren’t related to image processing.
COUNTLESS does have two disadvantages. The first being that while it can be used recursively, only the first iteration is guaranteed to be a mode of the original image. The mode of a 4×4 image might be different than that of four 2×2 images. The second disadvantage is that bitwise COUNTLESS in Python (though not in C) requires much more memory than counting. The original image, a, b, c, d, and the results of intermediate operations, and the final result must be retained while the algorithm runs resulting in at least a fourfold memory increase whereas counting need only store a small constant number of integers more than the original data. If the original data can be discarded after a, b, c, and d are generated, then only a threefold increase is required. It should be noted that _countlessif also requires only a few integers as well.
In its maiden application, COUNTLESS was employed to recursively generate MIP maps of segmentations derived from large electron micrographs of brain tissue using Python/numpy. While only the first MIP level is guaranteed to be the mode , the generated MIP levels blend more nicely than with striding, which tends to march diagonally across the screen as new MIP levels load.
Is COUNTLESS new? It’s possible this image processing algorithm has been invented before, and the underlying math has almost certainly been used in other contexts like pure math. However, I have not been able to find a reference for it yet.
There are several potentially fruitful directions in which to extend the COUNTLESS algorithm. The first involves the problem of random images, and the second involves extending the algorithm to three dimensions to handle volumetric tissue images.
With respect to random images, looking back to case 1(e), we’ll always pick the bottom right corner which on random or pathological data could cause the same diagonal shifting effect as naïve striding. This artifact is caused by the bitwise OR of d occurs for both 1(c) and 1(e). It may be possible to break apart 1(c) and 1(e) by adding a term (a != b != c != d) to account for 1e, however, it’s not immediately obvious how to parallelize a random choice of a, b, c, or d across all 2×2 blocks in an image. This change may also not be desirable because it makes the output non-deterministic.
With respect to volumetric images, since my lab works with 3D images of brain tissue, the question was raised as to whether this approach could be extended to a 2x2x2 cube (mode of 8). The simplest way to analyze this question is to consider a simpler case of whether we can extend this approach to take the mode of five integers without counting. In this case, if at least three pixels match, then the matching pixels are guaranteed to be correct. However, if there is no match, it falls to whether two match, and if no two match, then any pixel is a candidate. It’s clear that extending this approach requires a combinatorial explosion in the number of comparisons that need to be made. While its conceivable that this can be made more efficient than counting for five numbers, there are rapidly diminishing returns.
In a way, if statement based COUNTLESS is a kind of pre-literate algorithm that would have been used if no one had ever learned how to count. It seems clear that attempting to outcompete counting in C for modes of large numbers will be difficult. However, in Python, _quickcountless has an edge of 5,263x over counting on __ GICM implying that there is a lot of room for improvement even with substantial inefficiencies in the 3D case. An early demonstration suggests that 3D COUNTLESS may be as fast as about 4 Megavoxels/sec in Python/numpy, about 35x faster than 2D counting. There will be more experiments to come.
Edit Feb. 14, 2018: In an upcoming article on COUNTLESS 3D, I will document speeds up to 24.9 MVx/sec using Python3/numpy.
Edit Feb. 20, 2018: The COUNTLESS 3D article is now out.
Edit June 21, 2018: If you’d like to use COUNTLESS 2D on sparse data without turning the upper downsamples black, try Stippled COUNTLESS 2D.
Acknowledgments
Several people contributed helpful advice and assistance in developing COUNTLESS. Chris Jordan provided the seed of the C implementation of counting and countless. Dr. Aleks Zlateski contributed a Knight’s Landing SIMD version after this article was published. The ndzoom benchmark was contributed by Davit Buniatyan. Dr. George Nagy suggested testing _countlessif and testing the performance differential on homogenous and non-homogenous images. Jeremy Maitin-Shepard at Google originally developed the Python code for striding and _downsample_withaveraging for use with neuroglancer. Special thanks to Seung Lab for providing neural segmentation labels.
Code
The code used for testing this pipeline can be found on github. Contributions in additional languages and feedback are welcomed and will be credited.
March 11, 2019: Now available as part of the tinybrain PyPI package.
Updates & Errata
Dec. 10, 2019: The reason striding is so fast is because the operation as tested is only updating the internal striding numbers of the numpy array; it’s not actually making a copy of the array. These experiments should be rerun to reflect this.
July 9, 2018: Found a way to eliminate variable bc and reuse _abac for a small speedup (~2%?). Check github for this update.
Feb. 14, 2018: Updated charts and text with updated benchmark of Python code now using Python3.6.2. Tables 2, 3, & 4 and Figures 5, 7, & 10 have been replaced. The benchmark now includes code for the scipy function ndimage.interpolation.zoom contributed by Davit Buniatyan.❤️
Feb. 1, 2018: I discovered an error in the Python benchmarking code that caused the speeds of most algorithms to be underestimated by a factor of four. I will be updating this article soon with new results. I’ve found a way to make the countless implementation faster and more memory efficient. Python3 is also faster than Python2. See Github for the latest.
Aug. 21, 2017: After this article was published Aleks Zlateski contributed a Knight’s Landing (KNL) vectorized version of bitwise COUNTLESS that is reported to have run at 1 GPx/sec, 4 GB/sec on random int32 data. It can be found on Github.