Daniel Medeiros

Doctorand @ KTH
Data Engineer / DevOps


Color Quantization with GDAL

One of the works the Geophysicist usually do at the Geological Survey of Brazil is the correlation between geophysics and geology for geological mapping. This task is done through the use of gammaspectrometric data - and, since its penetration (skin depth) is very low, the data is deeply correlated to the surface geology. By assigning colors to Potassium, Thorium and Uranium values, it is possible to generate what is known as ‘ternary map’ - pretty much a geological map. As this map is usually shown in RGB, the range of colors is about 16 million - and this is a issue for interpretation. In this post, we discuss a method to reduce it to 27 colors.

First, why 27 colors? Because it’s what is done at the GSB. We assign a RGB color to each element (i.e.: Red for Potassium, Green for Thorium and Blue for Uranium) and assign a reference index of 1 for ‘low’ values (color index: 0), 2 for ‘medium’ (color index: 128) and 3 for ‘high’ (color index: 255). The combination of those values - from 111 to 333 - is simply 27. Thus it’s in our interest to separate what’s low from high in order to reduce subjectivity on data interpretation.

The most common approach for this problem is using a unsupervised learning method such as K-Means. The issue is that, depending on your input dataset, the results in terms of colors will not be equalized. Ideally, in a RGB world, where colors range from 0 to 255, a ‘low’ index would mean pixels between 0 to 85, ‘medium’ is 86 to 170 and ‘high’ is 170 onwards. If your dataset does not contain this range of colors, pretty much the result will follow another pattern. The same happens even with supervised learning methods, where you’d need some degree of precision in order to manually train the machine to assign colors - and it may end up being either overfitted or too deterministic.

In the end, the best solution for this problem is simply through a heuristic. The GDAL python binding helps to fasten the development, where we simply open the bands (for RGB, there are 3 bands) and loop through the pixels checking for the condition for rasters - this is done 27 times, one for each combination. GDAL copies the original tiff georeferencing, flushes to the disk drive and that’s it! The result is simply seen as below. Please note that some white dots within the image are pretty much the bugs when I converted it to GIF - but nonetheless, the result is good.

I believe the next step for a better solution is training a convolutional neural network, which probably will end reducing the execution time for problem(in my laptop, the heuristic takes between 10 to 15 minutes for a 40mb file).

_Note: Whenever I imported the tiff rasters to ArcGIS, it would badly display the data due to some statistical it automatically does. This is solvable by simply disabling the ‘gamma stretch’ and the equalization to ‘none’ (properties tab of the raster).