The Search Dilemma – Part 4

Note: This part produced a few hundreds of gigabytes of data from experiments. I’ll be showing only the important points. Also, the next part is probably the last one

We spoke about Elasticsearch in the first three posts of this series. However, due to the fact that a big overhead was introduced when instrumenting the Java Virtual Machine – and plus the overhead caused by the migrations between BIG and LITTLE cores -, we opted to change the benchmarking engine and the processor type.

Hence, we started working with a Xapian modified by Tailbench[1], a benchmarking suite. Xapian is the actual search engine core written in C++, while Tailbench provides a simplified API for configuring both the server and the client (also generating requests). Finally, we moved from the big.LITTLE/AMP processors to DVFS ones.

Continue reading

Python/Julia Packages for Scientific Computing in Geosciences

The best thing that has ever happened to Python is NumPy. Python is an easy language to learn overall and NumPy – written in good old C – allows everything to run in a seamless way, and this helped a lot to make Python rise as a scientific language for fields like image processing, machine learning and remote sensing.

On the other hand, Julia is a newer language which was in almost-perpetual beta until one or two years ago. Unlike Python, its entire standard library is written in Julia. This allows Julia to run even faster than NumPy but without the worries of excessive code optimizations (e.g. operation broadcasting/vectoring or even if your NumPy is using MKL or OpenBLAS). Most benchmarks compare Julia against a heavily optimized+compiled Python code (via Nuitka or Cython) and, well, if you’re going so far to write good code, it’s better to change engines.

The issues of Julia are that it is too new (so many packages are not matured enough or are barely maintained, unlike Python) and part of the design is a new concept for those used to traditional object-oriented programming – in particular, their implementation of the Multiple Dispatch methods is awesome. Finally, Julia has a slow startup time for compiling the actual code before running – and that might be a hassle for some types of applications.

But this post is not about the comparison of Julia and Python. Rather, it’s an actual rant of how I saw nearly three or four “geo”programming courses last week (mostly publicized on LinkedIn and Facebook) and none of them included actual important packages for geoscientists in its syllabus.

I’ll list my tackle on this.

Continue reading

From .NET Framework to .NET 5

Note: .NET 5 is still in preview version and does not officially supports Visual Basic.Net Forms as of April 23, 2020.

It has been over a decade since I wrote my last code in Visual Basic 6. A bunch of friends wanted to revive an old game engine, written in Visual Basic .NET, from a community I was part of and I accepted it – hence I am currently toying with Elysium on my free time.

VB6 stopped being supported by Microsoft a long time ago. While Elysium by itself deserves a separate post, I wanted to make the server run in Linux, so the team could create a default map together. Thus, I went to convert my project from .NET Framework 4.5 to .NET 5 (which also implies in converting to the new “Core” virtual machine).

Continue reading

On ‘Indexing Wikipedia at Elasticsearch’

After nearly 1 year working with Tailbench (and Xapian), I had to turn back and go to Elasticsearch again. This happened because Tailbench/Xapian has a lot of limitations that had to undergo a heavy code rewrite in order to bypass them – and, well, I was actually lacking time to do it.

While I won’t discuss again the issues I had with the Java Virtual Machine before (but later on, the ideas we had to circumvent all the bottlenecks), this is essentially a post that might be found by someone who had the same issues as me.

Continue reading

Pattern Recoignance for Images through Neural Nets

Note: The next two or three posts will have its code written in Python. After them, I’ll be moving exclusively to Julia due to performance issues.

I was recently asked to evaluate if there’s any kind of relationship between the geophysical signatures of magnetic+radiometric data and iron mineralizations. The datasets for the study area is about 10 gigabytes, with nearly 40 million lines – which is too much for both my notebook and for the personal server which runs this website’s nginx.

But the main point here is that there`s not many iron-mineralized points to build a decent model. Actually, there were only 10 confirmed points for a really large area. Well, I decided to try anyway, but instead of using the actual measurement values, my solution was resorting to RGB pixels.

Continue reading

The Search Dilemma: Part 3

Note: The approach described by this topic was tested more than 1 year ago, and we (me and my master’s advisor) decided to drop it as the actual implementation found too many technical barriers (e.g. real-time profiling with low overhead). Mostly of what follows here are some insights I had later.

On the last post, we discussed a first approach to the hot function model: whenever a thread accessed that zone, it would be promoted to the faster cores and, when exiting, demoted to slower cores. The first result did not show any improvements on quality of service. Why?

Continue reading

The Search Dilemma: Part 2

One of the most interesting things I’ve come across during my masters years is how applications behave. Obviously, some applications are more prone to code optimization than others, and those applications will most likely be composed of some cpu-intensive functions that may eventually turn into bottlenecks if put in a heavy-load production environment. One of my hypothesis to maintain quality of service and reduce energy consumption consisted in analyzing that specific hot function and monitoring threads – a thread in a core would eventually have its operating frequency upgraded while executing that function and, after it exits the hot function, the operating frequency would be degraded. The assumption is that some functions does not need to execute as faster as the hot function, and hence is consuming more energy.

Continue reading

The Search Dilemma: Part 1

Have you ever thought what if a search engine (like Google, Bing or Yahoo) took hours to answer your search queries? Well, neither do I. But I presume that most people would be angry and just stop using them. This assumption is corroborated by a 2009 study[R1] that revealed that a delay of 2 seconds in delivering search results may impact companies’ revenue in over 4% per user; in other words, slow answers equals to less cash flow.

Big companies have many ways to address this (quality-of-service) issue and make this response time faster: the most obvious of them is simply deploying faster processors, more memory caches and upgrading network speed for distributed computing. However, this approach is not really the most efficient as there are financial (deploying more servers cost money) and spatial (your datacenter has limited space) constraints. Jeff Dean[R2] shows some manners to circumvent these constraints and maximize the system’s efficiency while guaranteeing the same quality-of-service for all users. I’ll discuss one of them here.

Continue reading