in Computer Science

The Search Dilemma – Part 4

Note: This part produced a few hundreds of gigabytes of data from experiments. I’ll be showing only the important points. Also, the next part is probably the last one

We spoke about Elasticsearch in the first three posts of this series. However, due to the fact that a big overhead was introduced when instrumenting the Java Virtual Machine – and plus the overhead caused by the migrations between BIG and LITTLE cores -, we opted to change the benchmarking engine and the processor type.

Hence, we started working with a Xapian modified by Tailbench[1], a benchmarking suite. Xapian is the actual search engine core written in C++, while Tailbench provides a simplified API for configuring both the server and the client (also generating requests). Finally, we moved from the big.LITTLE/AMP processors to DVFS ones.

Since we moved to dynamic-voltage frequency-scaling processor, we had to actually to configure our Linux distro for it. We had to recompile our distro kernel to 5.0 or higher (why? because that’s how Linux works – in this case, it was due to some BPF features), install the acpi-cpufreq driver and some utils such as cpufreq-utils.

The ACPI driver is important because it allows a user control of the frequency on a per core granularity, while Intel PState doesn’t (at least as far as I know). There are six standard ‘governors’: userspace, performance, conservative, schedutil, ondemand and powersave. They are pretty straight-forward: userspace operates on a constant frequency set by the user, performance goes to the maximum processor frequency all the time, while powersave does the inverse. Ondemand, conservative and schedutil are gradual schedulers – in particular, ondemand leverages the CPU usage and after a certain threshold it goes to the maximum processor frequency and then starts scaling down.

There are two good ways to change CPU frequency: the first is via the CPUfreq API, where we can directly write the desired frequency to a file (/sys/devices/system/cpu/cpuX/cpufreq/scaling_setspeed) and wait about 10us for the transition. The second one is writing directly to the model-specific registers (modprobe msr), but it breaks the kernel controls (i.e. you dont know what is the real frequency anymore via cpufreq) and probably the energy measurement; but the frequency transition is at a lower cost.

Back to Tailbench, it has two functioning forms: networked and integrated. The former allows it to run as two separate processes (in the same machine or not) and does the communication via TCP/IP, while the latter is handled by a single process and has the communication through the memory. It has also a request generator, write in Java, that’s known as ‘harness’ and pretty much generates requests for all the benchmarks included in the suite.

As we did with Elasticsearch, we profiled Xapian as seen at the image below. We used KCacheGrind for it and although there’s a big overhead introduced, we assume that the overhead is constant and pretty much applied to the entire application. So we choose the processRequest() function from the tree to start working.

Xapian profiled by KCacheGrind.

Now we have to analyze what influences the total service time. Since we had control of the harness, we could change what kind of keyword length it sent to Xapian – and then we worked with low keylengths and high keylengths. This is the first option because it actually had shown difference within Elasticsearch.

So we used CPUfreq to alter for multiple frequencies and also ran multiple keywords length. The results are below and shows that there are not a direct correlation between the keylength and the service time – but with the operating frequency. This means that the processRequest() function actually does some degree of floating-point calculations and not only I/O.

Another factor we had under control was the query-per-second rate (QPS). We noticed that Tailbench’s Networked version was not able to handle values of QPS higher than 3000 – hence, only the Integrated version was testes. And the results shows that the higher the QPS, higher is the total service time.

But since the actual processing time does not change with the request keylength, how the QPS influences the service time? Simple: because the number of received requests gets so overwhelming on the server-side that the waiting time gets only bigger and bigger. Since the total service time is the actual processing time plus the waiting time, which gets bigger and bigger, the total service time also increases.

Finally, we had to establish a baseline. The governors we had so far only looks CPU-specific metrics, while we were approaching the thread scheduling with app information. We benchmarked the Schedutil, Conservative and Ondemand governors. The Ondemand governor seemed more appropriate as it consistently kept under the deadline of 6ms and had also a linear approach for the energy consumption.

Speaking of which, measuring the energy consumption was tricky. Since we were using the integrated version – which ran everything on the same process – we hardcoded the server threads into the socket 0, while the application actually started on socket 1. Intel’s RAPL allowed us to measure energy per socket, and hence socket 0 was labeled as the server consumption.

The first attempt of scheduling was fairly simple: running the entire application in a base frequency and instrumenting the hot function in order to elevate the frequency while the thread executes that specific function. While the concept seems simple, the implementation ended being fairly difficult as, well, you didn’t need to elevate the frequencies in *all* situations – but only when you had a very high QPS. In the end, the energy consumption was higher than Ondemand with this approach, and measuring the queue size was harder than expected.

A lot of experiments went by, but we ended up wanting to prove how viable was the idea. We decided that the main factors that affected energy was the number of active cores, the processor frequency, the QPS and some parameters of the actual scheduler. We ran all the possible combinations for it, and there are three graphs similar to the below (one of low QPS, another for mid QPS and another for high QPS). While I won’t discuss the parameters, the position of the triangle, the border color, the background color and the facing direction of the triangle all means something.

If you know all the parameters, you can beat the ondemand both on the energy consumption and also on the service time for each QPS. The image below show this, under the label of “Hurryup Simulation” – if you get the proper parameters according to your deadline, you’ll have the optimal solution. Since the imposed deadline was 5ms, we felt that we didn’t need to finish some requests in 2ms, and that allowed us to save energy by running at a lower frequency.

A more complex model can be built for this but at that point, Tailbench’s Xapian needed a complete rewrite to serve our purposes since the processing model relied on a fixed thread numbers, and hence we dropped the work with it. Back to Elasticsearch.

References:

[1] Harshad Kasture and Daniel Sanchez. TailBench: A Benchmark Suite and Evaluation Methodology for Latency-Critical Applications. Procedings of 2016’s IEEE International Symposium on Workload Characterization.