Daniel Medeiros


Doctorand @ KTH
Data Engineer / DevOps

Share: 

On 'Indexing Wikipedia at Elasticsearch'

After nearly 1 year working with Tailbench (and Xapian), I had to turn back and go to Elasticsearch again. This happened because Tailbench/Xapian has a lot of limitations that had to undergo a heavy code rewrite in order to bypass them - and, well, I was actually lacking time to do it.

While I won’t discuss again the issues I had with the Java Virtual Machine before (but later on, the ideas we had to circumvent all the bottlenecks), this is essentially a post that might be found by someone who had the same issues as me.

Web search just happens to be a good benchmark for measuring performance of task scheduling algorithms, and indexing Wikipedia ends up being a great dataset for this purpose. I actually had some discussions with my advisor regarding indexing methods for it, and also the actual shard size in a single node.

The first method is listed at Elasticsearch blog. It actually works, but somehow the index for the entire Wikipedia (e.g., enwiki-20200420-cirrussearch-content.json.gz) ends up being too big - about 80gb in size. Nevertheless, the wikimedia:search plugin needs to be compiled for the exact version of Elasticsearch we’re using - in this case, this mean that the most recent version for this process is using ES 6.5.4.

Looks like the number of APIs changes on ES 7.0 breaks the compatibility with this plugin (although I’m not sure about it). Aside this issue, following the tutorial is pretty much straight forward and might take about four to five hours of indexing.

Finally, the dump name is important. To use this method, all wikidump files must be a .json one, as the Cirrussearch plugin allows it to be indexed more easily (this file is different from the ones used with the other two methods below!).

The second and third methods are listed on another blog. The former uses logstash and the latter, the stream2es application. I actually had to use Stream2es back in 2018, and it worked with a few limitations: the actual dump had to be split in chunks, or it would eventually stop processing at some point. Plus, it’s indeed a very old application, as I’ve seen most people using it with Elasticsearch v2.x.

Anyway, the best action course for this is using the 400 mb chunks already available at the Wikidumps (e.g.: e_nwiki-latest-pages-articles12.xml-p3926862p5040436.bz2_). If you don’t use a shell script, this might be very a boring task to do manually.

Finally, using logstash to index. I just followed the tutorial, to be honest, and it did work fine. Not many issues, except that it also takes about four hours to index, and the file must be the same as used with stream2es (but it’s not necessary to split them).

However, I noticed some differences: the first method gave me a total index of about 80gb, while the one using logstash had 40gb. I probably indexed a lot more than I should with the first, while the other is simply the articles content.

I actually noticed some speed difference between these methods and tried to benchmark them. Plus, Elasticsearch creates five shards and one replica (ten shards in total) by default - probably to be easily scalable in the future. One of the arguments I had was that the number of IO operations could be outperforming the parallelism for such a big number of shards.

The shard rescaling was using the code below.

curl -XPOST "http://localhost:9200/_template/default" -H 'Content-Type: application/json' -d '{
   "index_patterns": ["*"],
   "order": -1,
   "settings": {
      "number_of_shards": "1",
      "number_of_replicas": "0"
   }
}'

Then I reindexed using logstash, and ran it with a load generator (FABAN, with the Zipfian distribution provided by the Cloud Suite package). Since I already know that there`s difference between low and high key-lengths, I filtered only the ones above 10. The image below outlines the results.

This just shows me that extremes aren’t good: the 75gb configuration takes too much time as it is too big, while the single shard configuration takes longer because, well, apparently there’s not parallelism in its search operations - and this fact negatively outperforms the lack of IO operations seen in the 5x8gb configuration.

As a disclaimer, I am not looking for the optimal shards/node cluster size, this is the result of a technical discussion. I’ll stick with the 5x8GB configuration for my next studies.