in Computer Science

Python/Julia Packages for Scientific Computing in Geosciences

The best thing that has ever happened to Python is NumPy. Python is an easy language to learn overall and NumPy – written in good old C – allows everything to run in a seamless way, and this helped a lot to make Python rise as a scientific language for fields like image processing, machine learning and remote sensing.

On the other hand, Julia is a newer language which was in almost-perpetual beta until one or two years ago. Unlike Python, its entire standard library is written in Julia. This allows Julia to run even faster than NumPy but without the worries of excessive code optimizations (e.g. operation broadcasting/vectoring or even if your NumPy is using MKL or OpenBLAS). Most benchmarks compare Julia against a heavily optimized+compiled Python code (via Nuitka or Cython) and, well, if you’re going so far to write good code, it’s better to change engines.

The issues of Julia are that it is too new (so many packages are not matured enough or are barely maintained, unlike Python) and part of the design is a new concept for those used to traditional object-oriented programming – in particular, their implementation of the Multiple Dispatch methods is awesome. Finally, Julia has a slow startup time for compiling the actual code before running – and that might be a hassle for some types of applications.

But this post is not about the comparison of Julia and Python. Rather, it’s an actual rant of how I saw nearly three or four “geo”programming courses last week (mostly publicized on LinkedIn and Facebook) and none of them included actual important packages for geoscientists in its syllabus.

I’ll list my tackle on this.

Data Crunching/Processing

If you’re into scientific computing, you’ll most likely be analyzing (reading, processing) a lot of data all the time. In Python, learning NumPy is nearly essential for any job to come. Understanding the concept of arrays, shapes, dtypes (e.g.: np.int32, np.float64) as well as the concept of broadcasting (so you don’t needlessly loop on slow python bytecode) is pretty much very essential. NumPy’s builtin functions (matrix multiplication, percentile, standard deviation) are handy most of the time. In Julia, simply understanding the core language is pretty much enough – all Julia functions are written in Julia, after all.

Since most of the data we work are structured (well, most of the data I work is structured), DataFrames.jl for Julia and Pandas for Python are the way to go here. I personally like the comparison of those two libraries with SQL: it allows you to select lines based in conditions, create new columns and, with the aid with a mathematical framework, operate the data. Inputing/Merging/Exporting are also handled by those frameworks (although importing for DataFrames.jl may require other lib such as CSV.jl).

Finally, the issue with both Dataframes.jl and Pandas is that they fit the entire dataset into the available memory of the machine. This might not be good if you’re working with a big dataset – even one with a few gigabytes might be an issue. Plus, they’re not scalable/distributed as well. For this, there’s Dask (Python, built over Pandas) and JuliaDB to the rescue. In particular, these two should be the standard for data scientists, although both of them introduces some degree of complexity.

Image Plotting

Matplotlib is Python’s leader here, and probably the most widespread plotting tool in Python. From lines to surfaces, from histograms to images, it pretty much does everything while also allowing a high level control of the plotting options (labels, dpi, axis, scale, etc). Seaborn is built atop Matplotlib, and some say the outputs are usually more beautiful – but sometimes you’ll need to fallback to Matplotlib due to the lack of finer-grained control in Seaborn. Both are easy to use and Pandas/Dask-compatible.

In Julia, JuliaPlots allows the selection of plenty of rendering engines. In particular, GR and Plotly (the Javascript library). This allows a very high level of customization and interactivity, maybe even surpassing Matplotlib in that regard. One line is also enough to generate a basic plot.

Image Processing

While the gold standard for image processing is OpenCV, I feel that its Python API is actually pretty outdated (or badly documented). That’s why I prefer scikit-image: it’s mature, very well documented and seamless integrated to other libraries like scikit-learn. The number of ready-to-use functions is also astounishing: from grayscale to thresholding, from radon transform to deconvolutions – and each function is pretty much highly customizable.

JuliaImages is the equivalent of scikit-image to Julia. Although very poorly documented, it includes as many out-of-the-box functions (e.g.: segmentation, filtering) as its Python counterpart – with the advantage of being written in Julia (so no worries with pixel-by-pixel processing).

Statistics / Optimizations

Disclaimer: I am not familiar with most Optimizations libraries in Julia.

I’ve used Scipy a lot of times for i) interpolations, ii) Fourier transforms and iii) linear algebra (mainly inverses matrix and eigenvalues). It’s from the same team as Numpy, so it’s a highly optimized code. You can also use special functions (e.g.: Bessel’s) but you won’t unless you’re from a very specific field.

Julia has a lot of linear algebra already built-in as the standard library (even with some LAPACK/BLAS compatibility layer). For interpolations (if you’re a geoscientist, you’ll use a lot), I’ve tried Interpolations.jl and GridInterpolations.jl – they work like a charm. Overall, the JuliaMath package works supposedly similar to Scipy, and JuliaOpt, which I did not test, has some interesting optimization functions.

Machine Learning

A lot of modules already includes this by default (OpenCV is an example). However, whenever we speak about dedicated libraries for machine learning, we have scikit-learn for Python, where it includes a lot of preprocessing pipelines (random sampling, one-hot encoding) and a lot of traditional machine learning methods (e.g.: Random Forest, Support Vector Machines/Regressor, etc). It’s really well documented and very easy to use – if you have the preprocessed data, you can build a ML model in three lines. And evaluate the model in a few more. However, it’s easiness makes it a bit more limited when handling neural networks – in particular, its Multilayer Perceptron interface is too simple for complex models.

Some people like to mathematically implement their code – both forward and back propagation and aux functions in a neural network, for example. Theano (Python) exists for this purpose – computing gradients and jacobians of matrix is a nobrainer here.

For deep-learning focused libraries, there’s PyTorch and Google’s Tensorflow. I never used the former (although I know it is loved by some), but I have some experience with the latter – in particular, its Keras API makes the design of multilayer perceptrons very easy; choosing between optimizers, metrics, losses and evaluating them are pretty straight forward. It also supports Convolutional layers for images.

On Julia side, we have MLJ as the equivalent for scikit-learn (it even has the compatibility layer with scikit-learn), but it actually lacks the preprocessing stuff and is a bit immature. Flux.jl and Knet.jl are similar to Tensorflow, while some praise the latter for being faster than the former. I personally think as Flux.jl having a simpler syntax for begginers.

There’s also an unofficial Tensorflow wrapper for Julia called Tensorflow.jl, but last time I checked there was very low activity in the project – plus, the Tensorflow team decided to create a new official wrapper for… Apple’s Swift. Why? Because it “has a much larger community, is syntactically closer to Python, and because we were more familiar with its internal implementation details – which allowed us to implement a prototype much faster“. At this point, I can’t be bothered with the Julia wrapper anymore.

Geo-spatial Libraries

This is the main purpose of this post. Probably all world-class geoscience softwares uses GDAL as engine (maybe except ESRI’s ArcGIS). GDAL is originally written in C++, but it has some wrappers for Python and Julia. GDAL by itself is not really friendly, although using some functions such as GetRasterBand (which is very handy for hyperspectral images) and Polygonyze for autocreating shapefiles. In Python, it’s common to use GDAL alongside OGR/OSR libs to work with projections, while in Julia we can use the GDAL.jl wrapper for everything – however, a more julian syntax can be found on ArchGDAL.jl, which can reduce some good lines of code.

Since GDAL is not high-level enough for most people, the Python ecossystem goes further with its tools: fiona, rasterio, pyproj and shapely helps working with geometries, projections and rasters in a even more simplified but powerful way. The equivalent for those in Julia is probably LibGEOS.jl and Shapefile.jl, but I didn’t use those for long. I don’t know if Julia has an equivalent to Python’s GeoPandas, which can actually lessen headaches when working with coordinates.

(Added 15/05/2020) I’ve taken note that there’s the Orfeo Toolbox (developed by France’s CNES) with its Python API which was made exclusively for Remote Sensing. There’s no equivalent for Julia.

Of course, since Python has been for a long time, its easier to find documentation and answers in StackOverflow – and that’s a very plus for begginners. However, I’m really in favour of teaching at least the basics of GDAL for most people – while you can use scikit-image or OpenCV for processing hyperspectral images, you’d need something to write the GTiff header in order to project it, and that’s when the basic knowledge of GDAL (or fiona/pyproj) shines. From those courses I described right at the header of the post, none cared about it – and, well, it’s a problem for the future.

Note: Julia is usually compatible with Python via PyCall, however I didn’t research the benchmark impacts of using Python libraries instead of the native ones – hence, I’m avoiding this discussion in this post.

Note 2: In particular, Python’s ecossystem is huge. This post covers the really basic and most general libraries for beginners. If you want to work with Natural Language Processing, there’s NLTK. Sometimes you need to obtain data from web, so Selenium or Scrapy is handy here. For web frameworks, there’s Flask/Django in Python and Genie.jl in Julia (tried all of them, they’re awesome). And so it goes.