Image Analysis: Dynamic Color Binning
A while back, I wrote about a little project where I extracted the historic surface area of Poland from Wikipedia maps using color information. The main procedure was to use ImageMagick to extract the most dominant colors, so that I could find the one that corresponded to the area of interest (i.e. Poland.) This color, however, was not perfectly uniform. Therefore I also extracted very similar colors and treated them as identical to the main color, because these colors probably arise due to JPEG artifacts and other imperfections.
I did that manually and approximately because the dataset was not very large, but by using OpenCV to analyze the images, we can automate this procedure, and perform a more systematic search for similar colors.
OpenCV is a nice image manipulation package for Python, and it comes with a function that will give us a color histogram, just like ImageMagick, so that part is taken care of. Now we want to scan this histogram for similar colors and group them. What kind of algorithm should we use?
What we want is akin to posterizing. We want to simplify the colors of the picture to take on fewer values, lumping together anything very similar. However, posterizing divides the color space into fixed bins. That means that the divisions between color bins are not adapted to our image, and colors that we think should be binned together might end up in different bins.
This problem can be seen in the image above. In the original (left) we can hardly tell that the blue-grey area is not one color, but it is not perfectly homogeneous. The posterized version (right) has one of the colors dropped into the wrong bin, causing the speckles.
We want to do something smarter. Rather than combine colors into fixed bins, we want to create bins that are centered around the dominant colors in the picture. An image like a map, that only uses a few colors, has just a couple of sharp peaks in color space, as can be seen in the following histogram (for the blue channel.)
Each peak is surrounded by small tails of similar colors. What we want is to take these peaks, which represent the dominant colors in the image, and project the surrounding tails onto those peaks.
To do that, we sort the image’s color histogram to put the most prominent color at the top. For example:
[[(191, 255, 255), 221623.0], [(128, 164, 87), 95261.0], [(241, 181, 127), 70964.0], [(19, 164, 102), 37208.0], [(155, 104, 111), 22599.0], [(190, 254, 254), 8777.0], [(127, 163, 86), 7487.0], [(193, 255, 255), 3896.0], [(0, 0, 0), 3467.0], [(190, 255, 255), 3448.0], ...
The tuples represent Blue-Green-Red values (OpenCV uses BGR rather than RGB internally) and the numbers after are the pixel count.
Now we loop through this sorted list of colors, and do the following: taking a dominant color, we loop through all colors that are below it (i.e. less dominant) and if they are similar, we add their pixel count to the count of the dominant color, and discard the similar color. I’ll dub this algorithm Dynamic Color Binning. In pseudo-code (Python-style) it looks like this:
for color1 in sorted_color_histogram: for color2 in sorted_color_histogram[pixels(color2) < pixels(color1)]: delta = distance_in_Lab-space(color2, color1): if ( delta < r ): pixels(color1) += pixels(color2) sorted_color_histogram.remove(color2)
The free parameter
r is the radius of the bin in colorspace. I’m using the L*a*b colorspace rather than just RGB. This colorspace was designed with the perceptive properties of the human eye in mind. (More info here.) Colors that are the same distance away in L*a*b colorspace are (approximately) equally distinct to humans. In Python, classes and methods for dealing with colors and colorspaces are conveniently provided by the
We end up with a reduced list that now contains just the dominant colors and their pixel count including the synonymous minor colors, which is the data we need for our surface area analysis. So, essentially we are done, but let’s examine what our algorithm is doing.
It is not present in the pseudo-code above, but we can keep track of which colors we have grouped together. We can then clean the image by replacing all the minor colors with their major synonym. Because OpenCV simply reads the image into a numpy array representing the pixels, it is easy to replace colors. Visually, it’s hard to see any difference (as it should be) but we can see the result in the image histograms.
The tails are gone, and just clean peaks are left. If we look at the list of dominant colors, we can see that it has been cleaned up, as can be seen in the next image. Previously, we found almost identical colors in the top 10 most dominant colors; these have now all been grouped under single headers.
So there you have it! We have devised an image processing algorithm suitable for cleaning up images, such as maps, that contain only a small number of (different) colors. Because this is now automated, we can easily scale this up to multiple images, like we did for the Poland maps, and automate the analysis of a time series. We’ll look at that next time.